<!DOCTYPE html>

<html lang="en">
  <head>
    <meta charset="utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0" /><meta name="generator" content="Docutils 0.19: https://docutils.sourceforge.io/" />

    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
    <meta http-equiv="x-ua-compatible" content="ie=edge">
    
    <title>6.4. 多路召回 &#8212; FunRec 推荐系统 0.0.1 documentation</title>

    <link rel="stylesheet" href="../_static/material-design-lite-1.3.0/material.blue-deep_orange.min.css" type="text/css" />
    <link rel="stylesheet" href="../_static/sphinx_materialdesign_theme.css" type="text/css" />
    <link rel="stylesheet" href="../_static/fontawesome/all.css" type="text/css" />
    <link rel="stylesheet" href="../_static/fonts.css" type="text/css" />
    <link rel="stylesheet" type="text/css" href="../_static/pygments.css" />
    <link rel="stylesheet" type="text/css" href="../_static/basic.css" />
    <link rel="stylesheet" type="text/css" href="../_static/d2l.css" />
    <script data-url_root="../" id="documentation_options" src="../_static/documentation_options.js"></script>
    <script src="../_static/jquery.js"></script>
    <script src="../_static/underscore.js"></script>
    <script src="../_static/_sphinx_javascript_frameworks_compat.js"></script>
    <script src="../_static/doctools.js"></script>
    <script src="../_static/sphinx_highlight.js"></script>
    <script src="../_static/d2l.js"></script>
    <link rel="index" title="Index" href="../genindex.html" />
    <link rel="search" title="Search" href="../search.html" />
    <link rel="next" title="6.5. 特征工程" href="5.feature_engineering.html" />
    <link rel="prev" title="6.3. 数据分析" href="3.analysis.html" /> 
  </head>
<body>
    <div class="mdl-layout mdl-js-layout mdl-layout--fixed-header mdl-layout--fixed-drawer"><header class="mdl-layout__header mdl-layout__header--waterfall ">
    <div class="mdl-layout__header-row">
        
        <nav class="mdl-navigation breadcrumb">
            <a class="mdl-navigation__link" href="index.html"><span class="section-number">6. </span>项目实践</a><i class="material-icons">navigate_next</i>
            <a class="mdl-navigation__link is-active"><span class="section-number">6.4. </span>多路召回</a>
        </nav>
        <div class="mdl-layout-spacer"></div>
        <nav class="mdl-navigation">
        
<form class="form-inline pull-sm-right" action="../search.html" method="get">
      <div class="mdl-textfield mdl-js-textfield mdl-textfield--expandable mdl-textfield--floating-label mdl-textfield--align-right">
        <label id="quick-search-icon" class="mdl-button mdl-js-button mdl-button--icon"  for="waterfall-exp">
          <i class="material-icons">search</i>
        </label>
        <div class="mdl-textfield__expandable-holder">
          <input class="mdl-textfield__input" type="text" name="q"  id="waterfall-exp" placeholder="Search" />
          <input type="hidden" name="check_keywords" value="yes" />
          <input type="hidden" name="area" value="default" />
        </div>
      </div>
      <div class="mdl-tooltip" data-mdl-for="quick-search-icon">
      Quick search
      </div>
</form>
        
<a id="button-show-source"
    class="mdl-button mdl-js-button mdl-button--icon"
    href="../_sources/chapter_5_projects/4.recall.rst.txt" rel="nofollow">
  <i class="material-icons">code</i>
</a>
<div class="mdl-tooltip" data-mdl-for="button-show-source">
Show Source
</div>
        </nav>
    </div>
    <div class="mdl-layout__header-row header-links">
      <div class="mdl-layout-spacer"></div>
      <nav class="mdl-navigation">
          
              <a  class="mdl-navigation__link" href="https://funrec-notebooks.s3.eu-west-3.amazonaws.com/fun-rec.zip">
                  <i class="fas fa-download"></i>
                  Jupyter 记事本
              </a>
          
              <a  class="mdl-navigation__link" href="https://github.com/datawhalechina/fun-rec">
                  <i class="fab fa-github"></i>
                  GitHub
              </a>
      </nav>
    </div>
</header><header class="mdl-layout__drawer">
    
          <!-- Title -->
      <span class="mdl-layout-title">
          <a class="title" href="../index.html">
              <span class="title-text">
                  FunRec 推荐系统
              </span>
          </a>
      </span>
    
    
      <div class="globaltoc">
        <span class="mdl-layout-title toc">Table Of Contents</span>
        
        
            
            <nav class="mdl-navigation">
                <ul>
<li class="toctree-l1"><a class="reference internal" href="../chapter_preface/index.html">前言</a></li>
<li class="toctree-l1"><a class="reference internal" href="../chapter_installation/index.html">安装</a></li>
<li class="toctree-l1"><a class="reference internal" href="../chapter_notation/index.html">符号</a></li>
</ul>
<ul class="current">
<li class="toctree-l1"><a class="reference internal" href="../chapter_0_introduction/index.html">1. 推荐系统概述</a><ul>
<li class="toctree-l2"><a class="reference internal" href="../chapter_0_introduction/1.intro.html">1.1. 推荐系统是什么？</a></li>
<li class="toctree-l2"><a class="reference internal" href="../chapter_0_introduction/2.outline.html">1.2. 本书概览</a></li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="../chapter_1_retrieval/index.html">2. 召回模型</a><ul>
<li class="toctree-l2"><a class="reference internal" href="../chapter_1_retrieval/1.cf/index.html">2.1. 协同过滤</a><ul>
<li class="toctree-l3"><a class="reference internal" href="../chapter_1_retrieval/1.cf/1.itemcf.html">2.1.1. 基于物品的协同过滤</a></li>
<li class="toctree-l3"><a class="reference internal" href="../chapter_1_retrieval/1.cf/2.usercf.html">2.1.2. 基于用户的协同过滤</a></li>
<li class="toctree-l3"><a class="reference internal" href="../chapter_1_retrieval/1.cf/3.mf.html">2.1.3. 矩阵分解</a></li>
<li class="toctree-l3"><a class="reference internal" href="../chapter_1_retrieval/1.cf/4.summary.html">2.1.4. 总结</a></li>
</ul>
</li>
<li class="toctree-l2"><a class="reference internal" href="../chapter_1_retrieval/2.embedding/index.html">2.2. 向量召回</a><ul>
<li class="toctree-l3"><a class="reference internal" href="../chapter_1_retrieval/2.embedding/1.i2i.html">2.2.1. I2I召回</a></li>
<li class="toctree-l3"><a class="reference internal" href="../chapter_1_retrieval/2.embedding/2.u2i.html">2.2.2. U2I召回</a></li>
<li class="toctree-l3"><a class="reference internal" href="../chapter_1_retrieval/2.embedding/3.summary.html">2.2.3. 总结</a></li>
</ul>
</li>
<li class="toctree-l2"><a class="reference internal" href="../chapter_1_retrieval/3.sequence/index.html">2.3. 序列召回</a><ul>
<li class="toctree-l3"><a class="reference internal" href="../chapter_1_retrieval/3.sequence/1.user_interests.html">2.3.1. 深化用户兴趣表示</a></li>
<li class="toctree-l3"><a class="reference internal" href="../chapter_1_retrieval/3.sequence/2.generateive_recall.html">2.3.2. 生成式召回方法</a></li>
<li class="toctree-l3"><a class="reference internal" href="../chapter_1_retrieval/3.sequence/3.summary.html">2.3.3. 总结</a></li>
</ul>
</li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="../chapter_2_ranking/index.html">3. 精排模型</a><ul>
<li class="toctree-l2"><a class="reference internal" href="../chapter_2_ranking/1.wide_and_deep.html">3.1. 记忆与泛化</a></li>
<li class="toctree-l2"><a class="reference internal" href="../chapter_2_ranking/2.feature_crossing/index.html">3.2. 特征交叉</a><ul>
<li class="toctree-l3"><a class="reference internal" href="../chapter_2_ranking/2.feature_crossing/1.second_order.html">3.2.1. 二阶特征交叉</a></li>
<li class="toctree-l3"><a class="reference internal" href="../chapter_2_ranking/2.feature_crossing/2.higher_order.html">3.2.2. 高阶特征交叉</a></li>
</ul>
</li>
<li class="toctree-l2"><a class="reference internal" href="../chapter_2_ranking/3.sequence.html">3.3. 序列建模</a></li>
<li class="toctree-l2"><a class="reference internal" href="../chapter_2_ranking/4.multi_objective/index.html">3.4. 多目标建模</a><ul>
<li class="toctree-l3"><a class="reference internal" href="../chapter_2_ranking/4.multi_objective/1.arch.html">3.4.1. 基础结构演进</a></li>
<li class="toctree-l3"><a class="reference internal" href="../chapter_2_ranking/4.multi_objective/2.dependency_modeling.html">3.4.2. 任务依赖建模</a></li>
<li class="toctree-l3"><a class="reference internal" href="../chapter_2_ranking/4.multi_objective/3.multi_loss_optim.html">3.4.3. 多目标损失融合</a></li>
</ul>
</li>
<li class="toctree-l2"><a class="reference internal" href="../chapter_2_ranking/5.multi_scenario/index.html">3.5. 多场景建模</a><ul>
<li class="toctree-l3"><a class="reference internal" href="../chapter_2_ranking/5.multi_scenario/1.multi_tower.html">3.5.1. 多塔结构</a></li>
<li class="toctree-l3"><a class="reference internal" href="../chapter_2_ranking/5.multi_scenario/2.dynamic_weight.html">3.5.2. 动态权重建模</a></li>
</ul>
</li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="../chapter_3_rerank/index.html">4. 重排模型</a><ul>
<li class="toctree-l2"><a class="reference internal" href="../chapter_3_rerank/1.greedy.html">4.1. 基于贪心的重排</a></li>
<li class="toctree-l2"><a class="reference internal" href="../chapter_3_rerank/2.personalized.html">4.2. 基于个性化的重排</a></li>
<li class="toctree-l2"><a class="reference internal" href="../chapter_3_rerank/3.summary.html">4.3. 本章小结</a></li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="../chapter_4_trends/index.html">5. 难点及热点研究</a><ul>
<li class="toctree-l2"><a class="reference internal" href="../chapter_4_trends/1.debias.html">5.1. 模型去偏</a></li>
<li class="toctree-l2"><a class="reference internal" href="../chapter_4_trends/2.cold_start.html">5.2. 冷启动问题</a></li>
<li class="toctree-l2"><a class="reference internal" href="../chapter_4_trends/3.generative.html">5.3. 生成式推荐</a></li>
<li class="toctree-l2"><a class="reference internal" href="../chapter_4_trends/4.summary.html">5.4. 本章小结</a></li>
</ul>
</li>
<li class="toctree-l1 current"><a class="reference internal" href="index.html">6. 项目实践</a><ul class="current">
<li class="toctree-l2"><a class="reference internal" href="1.understanding.html">6.1. 赛题理解</a></li>
<li class="toctree-l2"><a class="reference internal" href="2.baseline.html">6.2. Baseline</a></li>
<li class="toctree-l2"><a class="reference internal" href="3.analysis.html">6.3. 数据分析</a></li>
<li class="toctree-l2 current"><a class="current reference internal" href="#">6.4. 多路召回</a></li>
<li class="toctree-l2"><a class="reference internal" href="5.feature_engineering.html">6.5. 特征工程</a></li>
<li class="toctree-l2"><a class="reference internal" href="6.ranking.html">6.6. 排序模型</a></li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="../chapter_appendix/index.html">7. Appendix</a><ul>
<li class="toctree-l2"><a class="reference internal" href="../chapter_appendix/word2vec.html">7.1. Word2vec</a></li>
</ul>
</li>
</ul>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../chapter_references/references.html">参考文献</a></li>
</ul>

            </nav>
        
        </div>
    
</header>
        <main class="mdl-layout__content" tabIndex="0">

	<script type="text/javascript" src="../_static/sphinx_materialdesign_theme.js "></script>
    <header class="mdl-layout__drawer">
    
          <!-- Title -->
      <span class="mdl-layout-title">
          <a class="title" href="../index.html">
              <span class="title-text">
                  FunRec 推荐系统
              </span>
          </a>
      </span>
    
    
      <div class="globaltoc">
        <span class="mdl-layout-title toc">Table Of Contents</span>
        
        
            
            <nav class="mdl-navigation">
                <ul>
<li class="toctree-l1"><a class="reference internal" href="../chapter_preface/index.html">前言</a></li>
<li class="toctree-l1"><a class="reference internal" href="../chapter_installation/index.html">安装</a></li>
<li class="toctree-l1"><a class="reference internal" href="../chapter_notation/index.html">符号</a></li>
</ul>
<ul class="current">
<li class="toctree-l1"><a class="reference internal" href="../chapter_0_introduction/index.html">1. 推荐系统概述</a><ul>
<li class="toctree-l2"><a class="reference internal" href="../chapter_0_introduction/1.intro.html">1.1. 推荐系统是什么？</a></li>
<li class="toctree-l2"><a class="reference internal" href="../chapter_0_introduction/2.outline.html">1.2. 本书概览</a></li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="../chapter_1_retrieval/index.html">2. 召回模型</a><ul>
<li class="toctree-l2"><a class="reference internal" href="../chapter_1_retrieval/1.cf/index.html">2.1. 协同过滤</a><ul>
<li class="toctree-l3"><a class="reference internal" href="../chapter_1_retrieval/1.cf/1.itemcf.html">2.1.1. 基于物品的协同过滤</a></li>
<li class="toctree-l3"><a class="reference internal" href="../chapter_1_retrieval/1.cf/2.usercf.html">2.1.2. 基于用户的协同过滤</a></li>
<li class="toctree-l3"><a class="reference internal" href="../chapter_1_retrieval/1.cf/3.mf.html">2.1.3. 矩阵分解</a></li>
<li class="toctree-l3"><a class="reference internal" href="../chapter_1_retrieval/1.cf/4.summary.html">2.1.4. 总结</a></li>
</ul>
</li>
<li class="toctree-l2"><a class="reference internal" href="../chapter_1_retrieval/2.embedding/index.html">2.2. 向量召回</a><ul>
<li class="toctree-l3"><a class="reference internal" href="../chapter_1_retrieval/2.embedding/1.i2i.html">2.2.1. I2I召回</a></li>
<li class="toctree-l3"><a class="reference internal" href="../chapter_1_retrieval/2.embedding/2.u2i.html">2.2.2. U2I召回</a></li>
<li class="toctree-l3"><a class="reference internal" href="../chapter_1_retrieval/2.embedding/3.summary.html">2.2.3. 总结</a></li>
</ul>
</li>
<li class="toctree-l2"><a class="reference internal" href="../chapter_1_retrieval/3.sequence/index.html">2.3. 序列召回</a><ul>
<li class="toctree-l3"><a class="reference internal" href="../chapter_1_retrieval/3.sequence/1.user_interests.html">2.3.1. 深化用户兴趣表示</a></li>
<li class="toctree-l3"><a class="reference internal" href="../chapter_1_retrieval/3.sequence/2.generateive_recall.html">2.3.2. 生成式召回方法</a></li>
<li class="toctree-l3"><a class="reference internal" href="../chapter_1_retrieval/3.sequence/3.summary.html">2.3.3. 总结</a></li>
</ul>
</li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="../chapter_2_ranking/index.html">3. 精排模型</a><ul>
<li class="toctree-l2"><a class="reference internal" href="../chapter_2_ranking/1.wide_and_deep.html">3.1. 记忆与泛化</a></li>
<li class="toctree-l2"><a class="reference internal" href="../chapter_2_ranking/2.feature_crossing/index.html">3.2. 特征交叉</a><ul>
<li class="toctree-l3"><a class="reference internal" href="../chapter_2_ranking/2.feature_crossing/1.second_order.html">3.2.1. 二阶特征交叉</a></li>
<li class="toctree-l3"><a class="reference internal" href="../chapter_2_ranking/2.feature_crossing/2.higher_order.html">3.2.2. 高阶特征交叉</a></li>
</ul>
</li>
<li class="toctree-l2"><a class="reference internal" href="../chapter_2_ranking/3.sequence.html">3.3. 序列建模</a></li>
<li class="toctree-l2"><a class="reference internal" href="../chapter_2_ranking/4.multi_objective/index.html">3.4. 多目标建模</a><ul>
<li class="toctree-l3"><a class="reference internal" href="../chapter_2_ranking/4.multi_objective/1.arch.html">3.4.1. 基础结构演进</a></li>
<li class="toctree-l3"><a class="reference internal" href="../chapter_2_ranking/4.multi_objective/2.dependency_modeling.html">3.4.2. 任务依赖建模</a></li>
<li class="toctree-l3"><a class="reference internal" href="../chapter_2_ranking/4.multi_objective/3.multi_loss_optim.html">3.4.3. 多目标损失融合</a></li>
</ul>
</li>
<li class="toctree-l2"><a class="reference internal" href="../chapter_2_ranking/5.multi_scenario/index.html">3.5. 多场景建模</a><ul>
<li class="toctree-l3"><a class="reference internal" href="../chapter_2_ranking/5.multi_scenario/1.multi_tower.html">3.5.1. 多塔结构</a></li>
<li class="toctree-l3"><a class="reference internal" href="../chapter_2_ranking/5.multi_scenario/2.dynamic_weight.html">3.5.2. 动态权重建模</a></li>
</ul>
</li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="../chapter_3_rerank/index.html">4. 重排模型</a><ul>
<li class="toctree-l2"><a class="reference internal" href="../chapter_3_rerank/1.greedy.html">4.1. 基于贪心的重排</a></li>
<li class="toctree-l2"><a class="reference internal" href="../chapter_3_rerank/2.personalized.html">4.2. 基于个性化的重排</a></li>
<li class="toctree-l2"><a class="reference internal" href="../chapter_3_rerank/3.summary.html">4.3. 本章小结</a></li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="../chapter_4_trends/index.html">5. 难点及热点研究</a><ul>
<li class="toctree-l2"><a class="reference internal" href="../chapter_4_trends/1.debias.html">5.1. 模型去偏</a></li>
<li class="toctree-l2"><a class="reference internal" href="../chapter_4_trends/2.cold_start.html">5.2. 冷启动问题</a></li>
<li class="toctree-l2"><a class="reference internal" href="../chapter_4_trends/3.generative.html">5.3. 生成式推荐</a></li>
<li class="toctree-l2"><a class="reference internal" href="../chapter_4_trends/4.summary.html">5.4. 本章小结</a></li>
</ul>
</li>
<li class="toctree-l1 current"><a class="reference internal" href="index.html">6. 项目实践</a><ul class="current">
<li class="toctree-l2"><a class="reference internal" href="1.understanding.html">6.1. 赛题理解</a></li>
<li class="toctree-l2"><a class="reference internal" href="2.baseline.html">6.2. Baseline</a></li>
<li class="toctree-l2"><a class="reference internal" href="3.analysis.html">6.3. 数据分析</a></li>
<li class="toctree-l2 current"><a class="current reference internal" href="#">6.4. 多路召回</a></li>
<li class="toctree-l2"><a class="reference internal" href="5.feature_engineering.html">6.5. 特征工程</a></li>
<li class="toctree-l2"><a class="reference internal" href="6.ranking.html">6.6. 排序模型</a></li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="../chapter_appendix/index.html">7. Appendix</a><ul>
<li class="toctree-l2"><a class="reference internal" href="../chapter_appendix/word2vec.html">7.1. Word2vec</a></li>
</ul>
</li>
</ul>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../chapter_references/references.html">参考文献</a></li>
</ul>

            </nav>
        
        </div>
    
</header>

    <div class="document">
        <div class="page-content" role="main">
        
  <section id="id1">
<h1><span class="section-number">6.4. </span>多路召回<a class="headerlink" href="#id1" title="Permalink to this heading">¶</a></h1>
<p>所谓的“多路召回”策略，就是指采用不同的策略、特征或简单模型，分别召回一部分候选集，然后把候选集混合在一起供后续排序模型使用，可以明显的看出，“多路召回策略”是在“计算速度”和“召回率”之间进行权衡的结果。其中，各种简单策略保证候选集的快速召回，从不同角度设计的策略保证召回率接近理想的状态，不至于损伤排序效果。如下图是多路召回的一个示意图，在多路召回中，每个策略之间毫不相关，所以一般可以写并发多线程同时进行，这样可以更加高效。</p>
<figure class="align-default" id="id16">
<span id="multi-channel-recall"></span><a class="reference internal image-reference" href="../_images/3_multi_channel_recall.png"><img alt="../_images/3_multi_channel_recall.png" src="../_images/3_multi_channel_recall.png" style="width: 450px;" /></a>
<figcaption>
<p><span class="caption-number">图6.4.1 </span><span class="caption-text">多路召回</span><a class="headerlink" href="#id16" title="Permalink to this image">¶</a></p>
</figcaption>
</figure>
<p>上图只是一个多路召回的例子，也就是说可以使用多种不同的策略来获取用户排序的候选商品集合，而具体使用哪些召回策略其实是与业务强相关的
，针对不同的任务就会有对于该业务真实场景下需要考虑的召回规则。例如新闻推荐，召回规则可以是“热门视频”、“导演召回”、“演员召回”、“最近上映“、”流行趋势“、”类型召回“等等。</p>
<p><strong>导包</strong></p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span><span class="w"> </span><span class="nn">os</span><span class="o">,</span><span class="w"> </span><span class="nn">math</span><span class="o">,</span><span class="w"> </span><span class="nn">warnings</span><span class="o">,</span><span class="w"> </span><span class="nn">math</span><span class="o">,</span><span class="w"> </span><span class="nn">pickle</span><span class="o">,</span><span class="w"> </span><span class="nn">random</span>
<span class="kn">from</span><span class="w"> </span><span class="nn">pathlib</span><span class="w"> </span><span class="kn">import</span> <span class="n">Path</span>
<span class="kn">from</span><span class="w"> </span><span class="nn">datetime</span><span class="w"> </span><span class="kn">import</span> <span class="n">datetime</span>
<span class="kn">from</span><span class="w"> </span><span class="nn">collections</span><span class="w"> </span><span class="kn">import</span> <span class="n">defaultdict</span>
<span class="kn">import</span><span class="w"> </span><span class="nn">logging</span>
<span class="n">warnings</span><span class="o">.</span><span class="n">filterwarnings</span><span class="p">(</span><span class="s1">&#39;ignore&#39;</span><span class="p">)</span>
<span class="n">os</span><span class="o">.</span><span class="n">environ</span><span class="p">[</span><span class="s1">&#39;OMP_NUM_THREADS&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="s1">&#39;1&#39;</span>
<span class="n">logger</span> <span class="o">=</span> <span class="n">logging</span><span class="o">.</span><span class="n">getLogger</span><span class="p">(</span><span class="vm">__name__</span><span class="p">)</span>
<span class="n">logger</span><span class="o">.</span><span class="n">setLevel</span><span class="p">(</span><span class="n">logging</span><span class="o">.</span><span class="n">INFO</span><span class="p">)</span>

<span class="kn">import</span><span class="w"> </span><span class="nn">faiss</span>
<span class="kn">import</span><span class="w"> </span><span class="nn">pandas</span><span class="w"> </span><span class="k">as</span><span class="w"> </span><span class="nn">pd</span>
<span class="kn">import</span><span class="w"> </span><span class="nn">numpy</span><span class="w"> </span><span class="k">as</span><span class="w"> </span><span class="nn">np</span>
<span class="kn">from</span><span class="w"> </span><span class="nn">tqdm</span><span class="w"> </span><span class="kn">import</span> <span class="n">tqdm</span>


<span class="kn">from</span><span class="w"> </span><span class="nn">sklearn.preprocessing</span><span class="w"> </span><span class="kn">import</span> <span class="n">MinMaxScaler</span>
<span class="kn">from</span><span class="w"> </span><span class="nn">sklearn.preprocessing</span><span class="w"> </span><span class="kn">import</span> <span class="n">LabelEncoder</span>
<span class="kn">from</span><span class="w"> </span><span class="nn">sklearn.preprocessing</span><span class="w"> </span><span class="kn">import</span> <span class="n">LabelEncoder</span>
<span class="kn">from</span><span class="w"> </span><span class="nn">tensorflow.keras</span><span class="w"> </span><span class="kn">import</span> <span class="n">backend</span> <span class="k">as</span> <span class="n">K</span>
<span class="kn">from</span><span class="w"> </span><span class="nn">tensorflow.keras.models</span><span class="w"> </span><span class="kn">import</span> <span class="n">Model</span>
<span class="kn">from</span><span class="w"> </span><span class="nn">tensorflow.keras.preprocessing.sequence</span><span class="w"> </span><span class="kn">import</span> <span class="n">pad_sequences</span>
</pre></div>
</div>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="c1"># 数据路径</span>
<span class="n">base_path</span> <span class="o">=</span> <span class="n">Path</span><span class="p">(</span><span class="s1">&#39;../tmp/projects/news_recommendation&#39;</span><span class="p">)</span>
<span class="n">data_path</span> <span class="o">=</span> <span class="n">base_path</span> <span class="o">/</span> <span class="s1">&#39;data_raw&#39;</span>
<span class="n">save_path</span> <span class="o">=</span> <span class="n">base_path</span> <span class="o">/</span> <span class="s1">&#39;temp_results&#39;</span>
<span class="k">if</span> <span class="ow">not</span> <span class="n">save_path</span><span class="o">.</span><span class="n">exists</span><span class="p">():</span>
    <span class="n">save_path</span><span class="o">.</span><span class="n">mkdir</span><span class="p">(</span><span class="n">parents</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">exist_ok</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="c1"># 做召回评估的一个标志, 如果不进行评估就是直接使用全量数据进行召回</span>
<span class="n">metric_recall</span> <span class="o">=</span> <span class="kc">False</span>
</pre></div>
</div>
<section id="id2">
<h2><span class="section-number">6.4.1. </span>读取数据<a class="headerlink" href="#id2" title="Permalink to this heading">¶</a></h2>
<p>在一般的推荐系统比赛中读取数据部分主要分为三种模式，
不同的模式对应的不同的数据集： 1. Debug模式：
这个的目的是帮助我们基于数据先搭建一个简易的baseline并跑通，
保证写的baseline代码没有什么问题。 由于推荐比赛的数据往往非常巨大，
如果一上来直接采用全部的数据进行分析，搭建baseline框架，
往往会带来时间和设备上的损耗，
<strong>所以这时候我们往往需要从海量数据的训练集中随机抽取一部分样本来进行调试(train_click_log_sample)</strong>，
先跑通一个baseline。 2. 线下验证模式：
这个的目的是帮助我们在线下基于已有的训练集数据，
来选择好合适的模型和一些超参数。
<strong>所以我们这一块只需要加载整个训练集(train_click_log)</strong>，
然后把整个训练集再分成训练集和验证集。 训练集是模型的训练数据，
验证集部分帮助我们调整模型的参数和其他的一些超参数。 3. 线上模式：
我们用debug模式搭建起一个推荐系统比赛的baseline，
用线下验证模式选择好了模型和一些超参数，
这一部分就是真正的对于给定的测试集进行预测， 提交到线上，
<strong>所以这一块使用的训练数据集是全量的数据集(train_click_log+test_click_log)</strong></p>
<p>下面就分别对这三种不同的数据读取模式先建立不同的代导入函数，
方便后面针对不同的模式下导入数据。</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="c1"># debug模式： 从训练集中划出一部分数据来调试代码</span>
<span class="k">def</span><span class="w"> </span><span class="nf">get_all_click_sample</span><span class="p">(</span><span class="n">data_path</span><span class="p">,</span> <span class="n">sample_nums</span><span class="o">=</span><span class="mi">10000</span><span class="p">):</span>
<span class="w">    </span><span class="sd">&quot;&quot;&quot;</span>
<span class="sd">        训练集中采样一部分数据调试</span>
<span class="sd">        data_path: 原数据的存储路径</span>
<span class="sd">        sample_nums: 采样数目（这里由于机器的内存限制，可以采样用户做）</span>
<span class="sd">    &quot;&quot;&quot;</span>
    <span class="n">all_click</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="n">data_path</span> <span class="o">/</span> <span class="s1">&#39;train_click_log.csv&#39;</span><span class="p">)</span>
    <span class="n">all_user_ids</span> <span class="o">=</span> <span class="n">all_click</span><span class="o">.</span><span class="n">user_id</span><span class="o">.</span><span class="n">unique</span><span class="p">()</span>

    <span class="n">sample_user_ids</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">choice</span><span class="p">(</span><span class="n">all_user_ids</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="n">sample_nums</span><span class="p">,</span> <span class="n">replace</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>
    <span class="n">all_click</span> <span class="o">=</span> <span class="n">all_click</span><span class="p">[</span><span class="n">all_click</span><span class="p">[</span><span class="s1">&#39;user_id&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">isin</span><span class="p">(</span><span class="n">sample_user_ids</span><span class="p">)]</span>

    <span class="n">all_click</span> <span class="o">=</span> <span class="n">all_click</span><span class="o">.</span><span class="n">drop_duplicates</span><span class="p">(([</span><span class="s1">&#39;user_id&#39;</span><span class="p">,</span> <span class="s1">&#39;click_article_id&#39;</span><span class="p">,</span> <span class="s1">&#39;click_timestamp&#39;</span><span class="p">]))</span>
    <span class="k">return</span> <span class="n">all_click</span>

<span class="c1"># 读取点击数据，这里分成线上和线下，如果是为了获取线上提交结果应该讲测试集中的点击数据合并到总的数据中</span>
<span class="c1"># 如果是为了线下验证模型的有效性或者特征的有效性，可以只使用训练集</span>
<span class="k">def</span><span class="w"> </span><span class="nf">get_all_click_df</span><span class="p">(</span><span class="n">data_path</span><span class="p">,</span> <span class="n">offline</span><span class="o">=</span><span class="kc">True</span><span class="p">):</span>
    <span class="k">if</span> <span class="n">offline</span><span class="p">:</span>
        <span class="n">all_click</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="n">data_path</span> <span class="o">/</span> <span class="s1">&#39;train_click_log.csv&#39;</span><span class="p">)</span>
    <span class="k">else</span><span class="p">:</span>
        <span class="n">trn_click</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="n">data_path</span> <span class="o">/</span> <span class="s1">&#39;train_click_log.csv&#39;</span><span class="p">)</span>
        <span class="n">tst_click</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="n">data_path</span> <span class="o">/</span> <span class="s1">&#39;testA_click_log.csv&#39;</span><span class="p">)</span>

        <span class="c1"># all_click = trn_click.append(tst_click)</span>
        <span class="n">all_click</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">concat</span><span class="p">([</span><span class="n">trn_click</span><span class="p">,</span> <span class="n">tst_click</span><span class="p">])</span><span class="o">.</span><span class="n">reset_index</span><span class="p">(</span><span class="n">drop</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>

    <span class="n">all_click</span> <span class="o">=</span> <span class="n">all_click</span><span class="o">.</span><span class="n">drop_duplicates</span><span class="p">(([</span><span class="s1">&#39;user_id&#39;</span><span class="p">,</span> <span class="s1">&#39;click_article_id&#39;</span><span class="p">,</span> <span class="s1">&#39;click_timestamp&#39;</span><span class="p">]))</span>
    <span class="k">return</span> <span class="n">all_click</span>
</pre></div>
</div>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="c1"># 读取文章的基本属性</span>
<span class="k">def</span><span class="w"> </span><span class="nf">get_item_info_df</span><span class="p">(</span><span class="n">data_path</span><span class="p">):</span>
    <span class="n">item_info_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="n">data_path</span> <span class="o">/</span> <span class="s1">&#39;articles.csv&#39;</span><span class="p">)</span>

    <span class="c1"># 为了方便与训练集中的click_article_id拼接，需要把article_id修改成click_article_id</span>
    <span class="n">item_info_df</span> <span class="o">=</span> <span class="n">item_info_df</span><span class="o">.</span><span class="n">rename</span><span class="p">(</span><span class="n">columns</span><span class="o">=</span><span class="p">{</span><span class="s1">&#39;article_id&#39;</span><span class="p">:</span> <span class="s1">&#39;click_article_id&#39;</span><span class="p">})</span>

    <span class="k">return</span> <span class="n">item_info_df</span>
</pre></div>
</div>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="c1"># 读取文章的Embedding数据</span>
<span class="k">def</span><span class="w"> </span><span class="nf">get_item_emb_dict</span><span class="p">(</span><span class="n">data_path</span><span class="p">):</span>
    <span class="n">item_emb_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="n">data_path</span> <span class="o">/</span> <span class="s1">&#39;articles_emb.csv&#39;</span><span class="p">)</span>

    <span class="n">item_emb_cols</span> <span class="o">=</span> <span class="p">[</span><span class="n">x</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">item_emb_df</span><span class="o">.</span><span class="n">columns</span> <span class="k">if</span> <span class="s1">&#39;emb&#39;</span> <span class="ow">in</span> <span class="n">x</span><span class="p">]</span>
    <span class="n">item_emb_np</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">ascontiguousarray</span><span class="p">(</span><span class="n">item_emb_df</span><span class="p">[</span><span class="n">item_emb_cols</span><span class="p">])</span>
    <span class="c1"># 进行归一化</span>
    <span class="n">item_emb_np</span> <span class="o">=</span> <span class="n">item_emb_np</span> <span class="o">/</span> <span class="n">np</span><span class="o">.</span><span class="n">linalg</span><span class="o">.</span><span class="n">norm</span><span class="p">(</span><span class="n">item_emb_np</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">keepdims</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>

    <span class="n">item_emb_dict</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">(</span><span class="nb">zip</span><span class="p">(</span><span class="n">item_emb_df</span><span class="p">[</span><span class="s1">&#39;article_id&#39;</span><span class="p">],</span> <span class="n">item_emb_np</span><span class="p">))</span>
    <span class="n">pickle</span><span class="o">.</span><span class="n">dump</span><span class="p">(</span><span class="n">item_emb_dict</span><span class="p">,</span> <span class="nb">open</span><span class="p">(</span><span class="n">save_path</span> <span class="o">/</span> <span class="s1">&#39;item_content_emb.pkl&#39;</span><span class="p">,</span> <span class="s1">&#39;wb&#39;</span><span class="p">))</span>

    <span class="k">return</span> <span class="n">item_emb_dict</span>
</pre></div>
</div>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">max_min_scaler</span> <span class="o">=</span> <span class="k">lambda</span> <span class="n">x</span> <span class="p">:</span> <span class="p">(</span><span class="n">x</span><span class="o">-</span><span class="n">np</span><span class="o">.</span><span class="n">min</span><span class="p">(</span><span class="n">x</span><span class="p">))</span><span class="o">/</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">max</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="o">-</span><span class="n">np</span><span class="o">.</span><span class="n">min</span><span class="p">(</span><span class="n">x</span><span class="p">))</span>
</pre></div>
</div>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="c1"># 采样数据</span>
<span class="n">all_click_df</span> <span class="o">=</span> <span class="n">get_all_click_sample</span><span class="p">(</span><span class="n">data_path</span><span class="p">)</span>

<span class="c1"># 全量训练集</span>
<span class="c1"># all_click_df = get_all_click_df(data_path, offline=False)</span>

<span class="c1"># 对时间戳进行归一化,用于在关联规则的时候计算权重</span>
<span class="n">all_click_df</span><span class="p">[</span><span class="s1">&#39;click_timestamp&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">all_click_df</span><span class="p">[[</span><span class="s1">&#39;click_timestamp&#39;</span><span class="p">]]</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="n">max_min_scaler</span><span class="p">)</span>
</pre></div>
</div>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">item_info_df</span> <span class="o">=</span> <span class="n">get_item_info_df</span><span class="p">(</span><span class="n">data_path</span><span class="p">)</span>
</pre></div>
</div>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">item_emb_dict</span> <span class="o">=</span> <span class="n">get_item_emb_dict</span><span class="p">(</span><span class="n">data_path</span><span class="p">)</span>
</pre></div>
</div>
</section>
<section id="id3">
<h2><span class="section-number">6.4.2. </span>工具函数<a class="headerlink" href="#id3" title="Permalink to this heading">¶</a></h2>
<section id="id4">
<h3><span class="section-number">6.4.2.1. </span>获取用户-文章-时间函数<a class="headerlink" href="#id4" title="Permalink to this heading">¶</a></h3>
<p>这个在基于关联规则的用户协同过滤的时候会用到</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="c1"># 根据点击时间获取用户的点击文章序列   {user1: {item1: time1, item2: time2..}...}</span>
<span class="k">def</span><span class="w"> </span><span class="nf">get_user_item_time</span><span class="p">(</span><span class="n">click_df</span><span class="p">):</span>

    <span class="n">click_df</span> <span class="o">=</span> <span class="n">click_df</span><span class="o">.</span><span class="n">sort_values</span><span class="p">(</span><span class="s1">&#39;click_timestamp&#39;</span><span class="p">)</span>

    <span class="k">def</span><span class="w"> </span><span class="nf">make_item_time_pair</span><span class="p">(</span><span class="n">df</span><span class="p">):</span>
        <span class="k">return</span> <span class="nb">list</span><span class="p">(</span><span class="nb">zip</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;click_article_id&#39;</span><span class="p">],</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;click_timestamp&#39;</span><span class="p">]))</span>

    <span class="n">user_item_time_df</span> <span class="o">=</span> <span class="n">click_df</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s1">&#39;user_id&#39;</span><span class="p">)[[</span><span class="s1">&#39;click_article_id&#39;</span><span class="p">,</span> <span class="s1">&#39;click_timestamp&#39;</span><span class="p">]]</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">make_item_time_pair</span><span class="p">(</span><span class="n">x</span><span class="p">))</span>\
                                                            <span class="o">.</span><span class="n">reset_index</span><span class="p">()</span><span class="o">.</span><span class="n">rename</span><span class="p">(</span><span class="n">columns</span><span class="o">=</span><span class="p">{</span><span class="mi">0</span><span class="p">:</span> <span class="s1">&#39;item_time_list&#39;</span><span class="p">})</span>
    <span class="n">user_item_time_dict</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">(</span><span class="nb">zip</span><span class="p">(</span><span class="n">user_item_time_df</span><span class="p">[</span><span class="s1">&#39;user_id&#39;</span><span class="p">],</span> <span class="n">user_item_time_df</span><span class="p">[</span><span class="s1">&#39;item_time_list&#39;</span><span class="p">]))</span>

    <span class="k">return</span> <span class="n">user_item_time_dict</span>
</pre></div>
</div>
</section>
<section id="id5">
<h3><span class="section-number">6.4.2.2. </span>获取文章-用户-时间函数<a class="headerlink" href="#id5" title="Permalink to this heading">¶</a></h3>
<p>这个在基于关联规则的文章协同过滤的时候会用到</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="c1"># 根据时间获取商品被点击的用户序列  {item1: {user1: time1, user2: time2...}...}</span>
<span class="c1"># 这里的时间是用户点击当前商品的时间，好像没有直接的关系。</span>
<span class="k">def</span><span class="w"> </span><span class="nf">get_item_user_time_dict</span><span class="p">(</span><span class="n">click_df</span><span class="p">):</span>
    <span class="k">def</span><span class="w"> </span><span class="nf">make_user_time_pair</span><span class="p">(</span><span class="n">df</span><span class="p">):</span>
        <span class="k">return</span> <span class="nb">list</span><span class="p">(</span><span class="nb">zip</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;user_id&#39;</span><span class="p">],</span> <span class="n">df</span><span class="p">[</span><span class="s1">&#39;click_timestamp&#39;</span><span class="p">]))</span>

    <span class="n">click_df</span> <span class="o">=</span> <span class="n">click_df</span><span class="o">.</span><span class="n">sort_values</span><span class="p">(</span><span class="s1">&#39;click_timestamp&#39;</span><span class="p">)</span>
    <span class="n">item_user_time_df</span> <span class="o">=</span> <span class="n">click_df</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s1">&#39;click_article_id&#39;</span><span class="p">)[[</span><span class="s1">&#39;user_id&#39;</span><span class="p">,</span> <span class="s1">&#39;click_timestamp&#39;</span><span class="p">]]</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">make_user_time_pair</span><span class="p">(</span><span class="n">x</span><span class="p">))</span>\
                                                            <span class="o">.</span><span class="n">reset_index</span><span class="p">()</span><span class="o">.</span><span class="n">rename</span><span class="p">(</span><span class="n">columns</span><span class="o">=</span><span class="p">{</span><span class="mi">0</span><span class="p">:</span> <span class="s1">&#39;user_time_list&#39;</span><span class="p">})</span>

    <span class="n">item_user_time_dict</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">(</span><span class="nb">zip</span><span class="p">(</span><span class="n">item_user_time_df</span><span class="p">[</span><span class="s1">&#39;click_article_id&#39;</span><span class="p">],</span> <span class="n">item_user_time_df</span><span class="p">[</span><span class="s1">&#39;user_time_list&#39;</span><span class="p">]))</span>
    <span class="k">return</span> <span class="n">item_user_time_dict</span>
</pre></div>
</div>
</section>
<section id="id6">
<h3><span class="section-number">6.4.2.3. </span>获取历史和最后一次点击<a class="headerlink" href="#id6" title="Permalink to this heading">¶</a></h3>
<p>这个在评估召回结果， 特征工程和制作标签转成监督学习测试集的时候回用到</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="c1"># 获取当前数据的历史点击和最后一次点击</span>
<span class="k">def</span><span class="w"> </span><span class="nf">get_hist_and_last_click</span><span class="p">(</span><span class="n">all_click</span><span class="p">):</span>

    <span class="n">all_click</span> <span class="o">=</span> <span class="n">all_click</span><span class="o">.</span><span class="n">sort_values</span><span class="p">(</span><span class="n">by</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;user_id&#39;</span><span class="p">,</span> <span class="s1">&#39;click_timestamp&#39;</span><span class="p">])</span>
    <span class="n">click_last_df</span> <span class="o">=</span> <span class="n">all_click</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s1">&#39;user_id&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">tail</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>

    <span class="c1"># 如果用户只有一个点击，hist为空了，会导致训练的时候这个用户不可见，此时默认泄露一下</span>
    <span class="k">def</span><span class="w"> </span><span class="nf">hist_func</span><span class="p">(</span><span class="n">user_df</span><span class="p">):</span>
        <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">user_df</span><span class="p">)</span> <span class="o">==</span> <span class="mi">1</span><span class="p">:</span>
            <span class="k">return</span> <span class="n">user_df</span>
        <span class="k">else</span><span class="p">:</span>
            <span class="k">return</span> <span class="n">user_df</span><span class="p">[:</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span>

    <span class="n">click_hist_df</span> <span class="o">=</span> <span class="n">all_click</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s1">&#39;user_id&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="n">hist_func</span><span class="p">)</span><span class="o">.</span><span class="n">reset_index</span><span class="p">(</span><span class="n">drop</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>

    <span class="k">return</span> <span class="n">click_hist_df</span><span class="p">,</span> <span class="n">click_last_df</span>
</pre></div>
</div>
</section>
<section id="id7">
<h3><span class="section-number">6.4.2.4. </span>获取文章属性特征<a class="headerlink" href="#id7" title="Permalink to this heading">¶</a></h3>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="c1"># 获取文章id对应的基本属性，保存成字典的形式，方便后面召回阶段，冷启动阶段直接使用</span>
<span class="k">def</span><span class="w"> </span><span class="nf">get_item_info_dict</span><span class="p">(</span><span class="n">item_info_df</span><span class="p">):</span>
    <span class="n">max_min_scaler</span> <span class="o">=</span> <span class="k">lambda</span> <span class="n">x</span> <span class="p">:</span> <span class="p">(</span><span class="n">x</span><span class="o">-</span><span class="n">np</span><span class="o">.</span><span class="n">min</span><span class="p">(</span><span class="n">x</span><span class="p">))</span><span class="o">/</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">max</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="o">-</span><span class="n">np</span><span class="o">.</span><span class="n">min</span><span class="p">(</span><span class="n">x</span><span class="p">))</span>
    <span class="n">item_info_df</span><span class="p">[</span><span class="s1">&#39;created_at_ts&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">item_info_df</span><span class="p">[[</span><span class="s1">&#39;created_at_ts&#39;</span><span class="p">]]</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="n">max_min_scaler</span><span class="p">)</span>

    <span class="n">item_type_dict</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">(</span><span class="nb">zip</span><span class="p">(</span><span class="n">item_info_df</span><span class="p">[</span><span class="s1">&#39;click_article_id&#39;</span><span class="p">],</span> <span class="n">item_info_df</span><span class="p">[</span><span class="s1">&#39;category_id&#39;</span><span class="p">]))</span>
    <span class="n">item_words_dict</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">(</span><span class="nb">zip</span><span class="p">(</span><span class="n">item_info_df</span><span class="p">[</span><span class="s1">&#39;click_article_id&#39;</span><span class="p">],</span> <span class="n">item_info_df</span><span class="p">[</span><span class="s1">&#39;words_count&#39;</span><span class="p">]))</span>
    <span class="n">item_created_time_dict</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">(</span><span class="nb">zip</span><span class="p">(</span><span class="n">item_info_df</span><span class="p">[</span><span class="s1">&#39;click_article_id&#39;</span><span class="p">],</span> <span class="n">item_info_df</span><span class="p">[</span><span class="s1">&#39;created_at_ts&#39;</span><span class="p">]))</span>

    <span class="k">return</span> <span class="n">item_type_dict</span><span class="p">,</span> <span class="n">item_words_dict</span><span class="p">,</span> <span class="n">item_created_time_dict</span>
</pre></div>
</div>
</section>
<section id="id8">
<h3><span class="section-number">6.4.2.5. </span>获取用户历史点击的文章信息<a class="headerlink" href="#id8" title="Permalink to this heading">¶</a></h3>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="k">def</span><span class="w"> </span><span class="nf">get_user_hist_item_info_dict</span><span class="p">(</span><span class="n">all_click</span><span class="p">):</span>

    <span class="c1"># 获取user_id对应的用户历史点击文章类型的集合字典</span>
    <span class="n">user_hist_item_typs</span> <span class="o">=</span> <span class="n">all_click</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s1">&#39;user_id&#39;</span><span class="p">)[</span><span class="s1">&#39;category_id&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">agg</span><span class="p">(</span><span class="nb">set</span><span class="p">)</span><span class="o">.</span><span class="n">reset_index</span><span class="p">()</span>
    <span class="n">user_hist_item_typs_dict</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">(</span><span class="nb">zip</span><span class="p">(</span><span class="n">user_hist_item_typs</span><span class="p">[</span><span class="s1">&#39;user_id&#39;</span><span class="p">],</span> <span class="n">user_hist_item_typs</span><span class="p">[</span><span class="s1">&#39;category_id&#39;</span><span class="p">]))</span>

    <span class="c1"># 获取user_id对应的用户点击文章的集合</span>
    <span class="n">user_hist_item_ids_dict</span> <span class="o">=</span> <span class="n">all_click</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s1">&#39;user_id&#39;</span><span class="p">)[</span><span class="s1">&#39;click_article_id&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">agg</span><span class="p">(</span><span class="nb">set</span><span class="p">)</span><span class="o">.</span><span class="n">reset_index</span><span class="p">()</span>
    <span class="n">user_hist_item_ids_dict</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">(</span><span class="nb">zip</span><span class="p">(</span><span class="n">user_hist_item_ids_dict</span><span class="p">[</span><span class="s1">&#39;user_id&#39;</span><span class="p">],</span> <span class="n">user_hist_item_ids_dict</span><span class="p">[</span><span class="s1">&#39;click_article_id&#39;</span><span class="p">]))</span>

    <span class="c1"># 获取user_id对应的用户历史点击的文章的平均字数字典</span>
    <span class="n">user_hist_item_words</span> <span class="o">=</span> <span class="n">all_click</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s1">&#39;user_id&#39;</span><span class="p">)[</span><span class="s1">&#39;words_count&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">agg</span><span class="p">(</span><span class="s1">&#39;mean&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">reset_index</span><span class="p">()</span>
    <span class="n">user_hist_item_words_dict</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">(</span><span class="nb">zip</span><span class="p">(</span><span class="n">user_hist_item_words</span><span class="p">[</span><span class="s1">&#39;user_id&#39;</span><span class="p">],</span> <span class="n">user_hist_item_words</span><span class="p">[</span><span class="s1">&#39;words_count&#39;</span><span class="p">]))</span>

    <span class="c1"># 获取user_id对应的用户最后一次点击的文章的创建时间</span>
    <span class="n">all_click_</span> <span class="o">=</span> <span class="n">all_click</span><span class="o">.</span><span class="n">sort_values</span><span class="p">(</span><span class="s1">&#39;click_timestamp&#39;</span><span class="p">)</span>
    <span class="n">user_last_item_created_time</span> <span class="o">=</span> <span class="n">all_click_</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s1">&#39;user_id&#39;</span><span class="p">)[</span><span class="s1">&#39;created_at_ts&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="o">.</span><span class="n">iloc</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">])</span><span class="o">.</span><span class="n">reset_index</span><span class="p">()</span>

    <span class="n">max_min_scaler</span> <span class="o">=</span> <span class="k">lambda</span> <span class="n">x</span> <span class="p">:</span> <span class="p">(</span><span class="n">x</span><span class="o">-</span><span class="n">np</span><span class="o">.</span><span class="n">min</span><span class="p">(</span><span class="n">x</span><span class="p">))</span><span class="o">/</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">max</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="o">-</span><span class="n">np</span><span class="o">.</span><span class="n">min</span><span class="p">(</span><span class="n">x</span><span class="p">))</span>
    <span class="n">user_last_item_created_time</span><span class="p">[</span><span class="s1">&#39;created_at_ts&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">user_last_item_created_time</span><span class="p">[[</span><span class="s1">&#39;created_at_ts&#39;</span><span class="p">]]</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="n">max_min_scaler</span><span class="p">)</span>

    <span class="n">user_last_item_created_time_dict</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">(</span><span class="nb">zip</span><span class="p">(</span><span class="n">user_last_item_created_time</span><span class="p">[</span><span class="s1">&#39;user_id&#39;</span><span class="p">],</span> \
                                                <span class="n">user_last_item_created_time</span><span class="p">[</span><span class="s1">&#39;created_at_ts&#39;</span><span class="p">]))</span>

    <span class="k">return</span> <span class="n">user_hist_item_typs_dict</span><span class="p">,</span> <span class="n">user_hist_item_ids_dict</span><span class="p">,</span> <span class="n">user_hist_item_words_dict</span><span class="p">,</span> <span class="n">user_last_item_created_time_dict</span>
</pre></div>
</div>
</section>
<section id="top-k">
<h3><span class="section-number">6.4.2.6. </span>获取点击次数最多的Top-k个文章<a class="headerlink" href="#top-k" title="Permalink to this heading">¶</a></h3>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="c1"># 获取近期点击最多的文章</span>
<span class="k">def</span><span class="w"> </span><span class="nf">get_item_topk_click</span><span class="p">(</span><span class="n">click_df</span><span class="p">,</span> <span class="n">k</span><span class="p">):</span>
    <span class="n">topk_click</span> <span class="o">=</span> <span class="n">click_df</span><span class="p">[</span><span class="s1">&#39;click_article_id&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">value_counts</span><span class="p">()</span><span class="o">.</span><span class="n">index</span><span class="p">[:</span><span class="n">k</span><span class="p">]</span>
    <span class="k">return</span> <span class="n">topk_click</span>
</pre></div>
</div>
</section>
<section id="id9">
<h3><span class="section-number">6.4.2.7. </span>定义多路召回字典<a class="headerlink" href="#id9" title="Permalink to this heading">¶</a></h3>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="c1"># 获取文章的属性信息，保存成字典的形式方便查询</span>
<span class="n">item_type_dict</span><span class="p">,</span> <span class="n">item_words_dict</span><span class="p">,</span> <span class="n">item_created_time_dict</span> <span class="o">=</span> <span class="n">get_item_info_dict</span><span class="p">(</span><span class="n">item_info_df</span><span class="p">)</span>
</pre></div>
</div>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="c1"># 定义一个多路召回的字典，将各路召回的结果都保存在这个字典当中</span>
<span class="n">user_multi_recall_dict</span> <span class="o">=</span>  <span class="p">{</span><span class="s1">&#39;itemcf_sim_itemcf_recall&#39;</span><span class="p">:</span> <span class="p">{},</span>
                           <span class="s1">&#39;embedding_sim_item_recall&#39;</span><span class="p">:</span> <span class="p">{},</span>
                           <span class="s1">&#39;youtubednn_recall&#39;</span><span class="p">:</span> <span class="p">{},</span>
                           <span class="s1">&#39;youtubednn_usercf_recall&#39;</span><span class="p">:</span> <span class="p">{},</span>
                           <span class="s1">&#39;cold_start_recall&#39;</span><span class="p">:</span> <span class="p">{}}</span>
</pre></div>
</div>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="c1"># 提取最后一次点击作为召回评估，如果不需要做召回评估直接使用全量的训练集进行召回(线下验证模型)</span>
<span class="c1"># 如果不是召回评估，直接使用全量数据进行召回，不用将最后一次提取出来</span>
<span class="n">trn_hist_click_df</span><span class="p">,</span> <span class="n">trn_last_click_df</span> <span class="o">=</span> <span class="n">get_hist_and_last_click</span><span class="p">(</span><span class="n">all_click_df</span><span class="p">)</span>
</pre></div>
</div>
</section>
<section id="id10">
<h3><span class="section-number">6.4.2.8. </span>召回效果评估<a class="headerlink" href="#id10" title="Permalink to this heading">¶</a></h3>
<p>做完了召回有时候也需要对当前的召回方法或者参数进行调整以达到更好的召回效果，因为召回的结果决定了最终排序的上限，下面也会提供一个召回评估的方法</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="c1"># 依次评估召回的前10, 20, 30, 40, 50个文章中的击中率</span>
<span class="k">def</span><span class="w"> </span><span class="nf">metrics_recall</span><span class="p">(</span><span class="n">user_recall_items_dict</span><span class="p">,</span> <span class="n">trn_last_click_df</span><span class="p">,</span> <span class="n">topk</span><span class="o">=</span><span class="mi">5</span><span class="p">):</span>
    <span class="n">last_click_item_dict</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">(</span><span class="nb">zip</span><span class="p">(</span><span class="n">trn_last_click_df</span><span class="p">[</span><span class="s1">&#39;user_id&#39;</span><span class="p">],</span> <span class="n">trn_last_click_df</span><span class="p">[</span><span class="s1">&#39;click_article_id&#39;</span><span class="p">]))</span>
    <span class="n">user_num</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">user_recall_items_dict</span><span class="p">)</span>

    <span class="k">for</span> <span class="n">k</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="n">topk</span><span class="o">+</span><span class="mi">1</span><span class="p">,</span> <span class="mi">10</span><span class="p">):</span>
        <span class="n">hit_num</span> <span class="o">=</span> <span class="mi">0</span>
        <span class="k">for</span> <span class="n">user</span><span class="p">,</span> <span class="n">item_list</span> <span class="ow">in</span> <span class="n">user_recall_items_dict</span><span class="o">.</span><span class="n">items</span><span class="p">():</span>
            <span class="c1"># 获取前k个召回的结果</span>
            <span class="n">tmp_recall_items</span> <span class="o">=</span> <span class="p">[</span><span class="n">x</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">user_recall_items_dict</span><span class="p">[</span><span class="n">user</span><span class="p">][:</span><span class="n">k</span><span class="p">]]</span>
            <span class="k">if</span> <span class="n">last_click_item_dict</span><span class="p">[</span><span class="n">user</span><span class="p">]</span> <span class="ow">in</span> <span class="nb">set</span><span class="p">(</span><span class="n">tmp_recall_items</span><span class="p">):</span>
                <span class="n">hit_num</span> <span class="o">+=</span> <span class="mi">1</span>

        <span class="n">hit_rate</span> <span class="o">=</span> <span class="nb">round</span><span class="p">(</span><span class="n">hit_num</span> <span class="o">*</span> <span class="mf">1.0</span> <span class="o">/</span> <span class="n">user_num</span><span class="p">,</span> <span class="mi">5</span><span class="p">)</span>
        <span class="nb">print</span><span class="p">(</span><span class="s1">&#39; topk: &#39;</span><span class="p">,</span> <span class="n">k</span><span class="p">,</span> <span class="s1">&#39; : &#39;</span><span class="p">,</span> <span class="s1">&#39;hit_num: &#39;</span><span class="p">,</span> <span class="n">hit_num</span><span class="p">,</span> <span class="s1">&#39;hit_rate: &#39;</span><span class="p">,</span> <span class="n">hit_rate</span><span class="p">,</span> <span class="s1">&#39;user_num : &#39;</span><span class="p">,</span> <span class="n">user_num</span><span class="p">)</span>
</pre></div>
</div>
</section>
</section>
<section id="id11">
<h2><span class="section-number">6.4.3. </span>计算相似性矩阵<a class="headerlink" href="#id11" title="Permalink to this heading">¶</a></h2>
<p>这一部分主要是通过协同过滤以及向量检索得到相似性矩阵，相似性矩阵主要分为user2user和item2item，下面依次获取基于itemCF的item2item的相似性矩阵。</p>
<section id="itemcf-i2i-sim">
<h3><span class="section-number">6.4.3.1. </span>itemCF i2i_sim<a class="headerlink" href="#itemcf-i2i-sim" title="Permalink to this heading">¶</a></h3>
<p>借鉴KDD2020的去偏商品推荐，在计算item2item相似性矩阵时，使用关联规则，使得计算的文章的相似性还考虑到了:
1. 用户点击的时间权重 2. 用户点击的顺序权重 3. 文章创建的时间权重</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="k">def</span><span class="w"> </span><span class="nf">itemcf_sim</span><span class="p">(</span><span class="n">df</span><span class="p">,</span> <span class="n">item_created_time_dict</span><span class="p">):</span>
<span class="w">    </span><span class="sd">&quot;&quot;&quot;</span>
<span class="sd">        文章与文章之间的相似性矩阵计算</span>
<span class="sd">        :param df: 数据表</span>
<span class="sd">        :item_created_time_dict:  文章创建时间的字典</span>
<span class="sd">        return : 文章与文章的相似性矩阵</span>

<span class="sd">        思路: 基于物品的协同过滤(详细请参考上一期推荐系统基础的组队学习) + 关联规则</span>
<span class="sd">    &quot;&quot;&quot;</span>

    <span class="n">user_item_time_dict</span> <span class="o">=</span> <span class="n">get_user_item_time</span><span class="p">(</span><span class="n">df</span><span class="p">)</span>

    <span class="c1"># 计算物品相似度</span>
    <span class="n">i2i_sim</span> <span class="o">=</span> <span class="p">{}</span>
    <span class="n">item_cnt</span> <span class="o">=</span> <span class="n">defaultdict</span><span class="p">(</span><span class="nb">int</span><span class="p">)</span>
    <span class="k">for</span> <span class="n">user</span><span class="p">,</span> <span class="n">item_time_list</span> <span class="ow">in</span> <span class="n">tqdm</span><span class="p">(</span><span class="n">user_item_time_dict</span><span class="o">.</span><span class="n">items</span><span class="p">(),</span> <span class="n">disable</span><span class="o">=</span><span class="ow">not</span> <span class="n">logger</span><span class="o">.</span><span class="n">isEnabledFor</span><span class="p">(</span><span class="n">logging</span><span class="o">.</span><span class="n">DEBUG</span><span class="p">)):</span>
        <span class="c1"># 在基于商品的协同过滤优化的时候可以考虑时间因素</span>
        <span class="k">for</span> <span class="n">loc1</span><span class="p">,</span> <span class="p">(</span><span class="n">i</span><span class="p">,</span> <span class="n">i_click_time</span><span class="p">)</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">item_time_list</span><span class="p">):</span>
            <span class="n">item_cnt</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">+=</span> <span class="mi">1</span>
            <span class="n">i2i_sim</span><span class="o">.</span><span class="n">setdefault</span><span class="p">(</span><span class="n">i</span><span class="p">,</span> <span class="p">{})</span>
            <span class="k">for</span> <span class="n">loc2</span><span class="p">,</span> <span class="p">(</span><span class="n">j</span><span class="p">,</span> <span class="n">j_click_time</span><span class="p">)</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">item_time_list</span><span class="p">):</span>
                <span class="k">if</span><span class="p">(</span><span class="n">i</span> <span class="o">==</span> <span class="n">j</span><span class="p">):</span>
                    <span class="k">continue</span>

                <span class="c1"># 考虑文章的正向顺序点击和反向顺序点击</span>
                <span class="n">loc_alpha</span> <span class="o">=</span> <span class="mf">1.0</span> <span class="k">if</span> <span class="n">loc2</span> <span class="o">&gt;</span> <span class="n">loc1</span> <span class="k">else</span> <span class="mf">0.7</span>
                <span class="c1"># 位置信息权重，其中的参数可以调节</span>
                <span class="n">loc_weight</span> <span class="o">=</span> <span class="n">loc_alpha</span> <span class="o">*</span> <span class="p">(</span><span class="mf">0.9</span> <span class="o">**</span> <span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">abs</span><span class="p">(</span><span class="n">loc2</span> <span class="o">-</span> <span class="n">loc1</span><span class="p">)</span> <span class="o">-</span> <span class="mi">1</span><span class="p">))</span>
                <span class="c1"># 点击时间权重，其中的参数可以调节</span>
                <span class="n">click_time_weight</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">exp</span><span class="p">(</span><span class="mf">0.7</span> <span class="o">**</span> <span class="n">np</span><span class="o">.</span><span class="n">abs</span><span class="p">(</span><span class="n">i_click_time</span> <span class="o">-</span> <span class="n">j_click_time</span><span class="p">))</span>
                <span class="c1"># 两篇文章创建时间的权重，其中的参数可以调节</span>
                <span class="n">created_time_weight</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">exp</span><span class="p">(</span><span class="mf">0.8</span> <span class="o">**</span> <span class="n">np</span><span class="o">.</span><span class="n">abs</span><span class="p">(</span><span class="n">item_created_time_dict</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">-</span> <span class="n">item_created_time_dict</span><span class="p">[</span><span class="n">j</span><span class="p">]))</span>
                <span class="n">i2i_sim</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">.</span><span class="n">setdefault</span><span class="p">(</span><span class="n">j</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>
                <span class="c1"># 考虑多种因素的权重计算最终的文章之间的相似度</span>
                <span class="n">i2i_sim</span><span class="p">[</span><span class="n">i</span><span class="p">][</span><span class="n">j</span><span class="p">]</span> <span class="o">+=</span> <span class="n">loc_weight</span> <span class="o">*</span> <span class="n">click_time_weight</span> <span class="o">*</span> <span class="n">created_time_weight</span> <span class="o">/</span> <span class="n">math</span><span class="o">.</span><span class="n">log</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">item_time_list</span><span class="p">)</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span>

    <span class="n">i2i_sim_</span> <span class="o">=</span> <span class="n">i2i_sim</span><span class="o">.</span><span class="n">copy</span><span class="p">()</span>
    <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">related_items</span> <span class="ow">in</span> <span class="n">i2i_sim</span><span class="o">.</span><span class="n">items</span><span class="p">():</span>
        <span class="k">for</span> <span class="n">j</span><span class="p">,</span> <span class="n">wij</span> <span class="ow">in</span> <span class="n">related_items</span><span class="o">.</span><span class="n">items</span><span class="p">():</span>
            <span class="n">i2i_sim_</span><span class="p">[</span><span class="n">i</span><span class="p">][</span><span class="n">j</span><span class="p">]</span> <span class="o">=</span> <span class="n">wij</span> <span class="o">/</span> <span class="n">math</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">item_cnt</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">*</span> <span class="n">item_cnt</span><span class="p">[</span><span class="n">j</span><span class="p">])</span>

    <span class="c1"># 将得到的相似性矩阵保存到本地</span>
    <span class="n">pickle</span><span class="o">.</span><span class="n">dump</span><span class="p">(</span><span class="n">i2i_sim_</span><span class="p">,</span> <span class="nb">open</span><span class="p">(</span><span class="n">save_path</span> <span class="o">/</span> <span class="s1">&#39;itemcf_i2i_sim.pkl&#39;</span><span class="p">,</span> <span class="s1">&#39;wb&#39;</span><span class="p">))</span>

    <span class="k">return</span> <span class="n">i2i_sim_</span>
</pre></div>
</div>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">i2i_sim</span> <span class="o">=</span> <span class="n">itemcf_sim</span><span class="p">(</span><span class="n">all_click_df</span><span class="p">,</span> <span class="n">item_created_time_dict</span><span class="p">)</span>
</pre></div>
</div>
</section>
<section id="usercf-u2u-sim">
<h3><span class="section-number">6.4.3.2. </span>userCF u2u_sim<a class="headerlink" href="#usercf-u2u-sim" title="Permalink to this heading">¶</a></h3>
<p>在计算用户之间的相似度的时候，也可以使用一些简单的关联规则，比如用户活跃度权重，这里将用户的点击次数作为用户活跃度的指标</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="k">def</span><span class="w"> </span><span class="nf">get_user_activate_degree_dict</span><span class="p">(</span><span class="n">all_click_df</span><span class="p">):</span>
    <span class="n">all_click_df_</span> <span class="o">=</span> <span class="n">all_click_df</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s1">&#39;user_id&#39;</span><span class="p">)[</span><span class="s1">&#39;click_article_id&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">count</span><span class="p">()</span><span class="o">.</span><span class="n">reset_index</span><span class="p">()</span>

    <span class="c1"># 用户活跃度归一化</span>
    <span class="n">mm</span> <span class="o">=</span> <span class="n">MinMaxScaler</span><span class="p">()</span>
    <span class="n">all_click_df_</span><span class="p">[</span><span class="s1">&#39;click_article_id&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">mm</span><span class="o">.</span><span class="n">fit_transform</span><span class="p">(</span><span class="n">all_click_df_</span><span class="p">[[</span><span class="s1">&#39;click_article_id&#39;</span><span class="p">]])</span>
    <span class="n">user_activate_degree_dict</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">(</span><span class="nb">zip</span><span class="p">(</span><span class="n">all_click_df_</span><span class="p">[</span><span class="s1">&#39;user_id&#39;</span><span class="p">],</span> <span class="n">all_click_df_</span><span class="p">[</span><span class="s1">&#39;click_article_id&#39;</span><span class="p">]))</span>

    <span class="k">return</span> <span class="n">user_activate_degree_dict</span>
</pre></div>
</div>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="k">def</span><span class="w"> </span><span class="nf">usercf_sim</span><span class="p">(</span><span class="n">all_click_df</span><span class="p">,</span> <span class="n">user_activate_degree_dict</span><span class="p">):</span>
<span class="w">    </span><span class="sd">&quot;&quot;&quot;</span>
<span class="sd">        用户相似性矩阵计算</span>
<span class="sd">        :param all_click_df: 数据表</span>
<span class="sd">        :param user_activate_degree_dict: 用户活跃度的字典</span>
<span class="sd">        return 用户相似性矩阵</span>

<span class="sd">        思路: 基于用户的协同过滤(详细请参考上一期推荐系统基础的组队学习) + 关联规则</span>
<span class="sd">    &quot;&quot;&quot;</span>
    <span class="n">item_user_time_dict</span> <span class="o">=</span> <span class="n">get_item_user_time_dict</span><span class="p">(</span><span class="n">all_click_df</span><span class="p">)</span>

    <span class="n">u2u_sim</span> <span class="o">=</span> <span class="p">{}</span>
    <span class="n">user_cnt</span> <span class="o">=</span> <span class="n">defaultdict</span><span class="p">(</span><span class="nb">int</span><span class="p">)</span>
    <span class="k">for</span> <span class="n">item</span><span class="p">,</span> <span class="n">user_time_list</span> <span class="ow">in</span> <span class="n">tqdm</span><span class="p">(</span><span class="n">item_user_time_dict</span><span class="o">.</span><span class="n">items</span><span class="p">(),</span> <span class="n">disable</span><span class="o">=</span><span class="ow">not</span> <span class="n">logger</span><span class="o">.</span><span class="n">isEnabledFor</span><span class="p">(</span><span class="n">logging</span><span class="o">.</span><span class="n">DEBUG</span><span class="p">)):</span>
        <span class="k">for</span> <span class="n">u</span><span class="p">,</span> <span class="n">click_time</span> <span class="ow">in</span> <span class="n">user_time_list</span><span class="p">:</span>
            <span class="n">user_cnt</span><span class="p">[</span><span class="n">u</span><span class="p">]</span> <span class="o">+=</span> <span class="mi">1</span>
            <span class="n">u2u_sim</span><span class="o">.</span><span class="n">setdefault</span><span class="p">(</span><span class="n">u</span><span class="p">,</span> <span class="p">{})</span>
            <span class="k">for</span> <span class="n">v</span><span class="p">,</span> <span class="n">click_time</span> <span class="ow">in</span> <span class="n">user_time_list</span><span class="p">:</span>
                <span class="n">u2u_sim</span><span class="p">[</span><span class="n">u</span><span class="p">]</span><span class="o">.</span><span class="n">setdefault</span><span class="p">(</span><span class="n">v</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>
                <span class="k">if</span> <span class="n">u</span> <span class="o">==</span> <span class="n">v</span><span class="p">:</span>
                    <span class="k">continue</span>
                <span class="c1"># 用户平均活跃度作为活跃度的权重，这里的式子也可以改善</span>
                <span class="n">activate_weight</span> <span class="o">=</span> <span class="mi">100</span> <span class="o">*</span> <span class="mf">0.5</span> <span class="o">*</span> <span class="p">(</span><span class="n">user_activate_degree_dict</span><span class="p">[</span><span class="n">u</span><span class="p">]</span> <span class="o">+</span> <span class="n">user_activate_degree_dict</span><span class="p">[</span><span class="n">v</span><span class="p">])</span>
                <span class="n">u2u_sim</span><span class="p">[</span><span class="n">u</span><span class="p">][</span><span class="n">v</span><span class="p">]</span> <span class="o">+=</span> <span class="n">activate_weight</span> <span class="o">/</span> <span class="n">math</span><span class="o">.</span><span class="n">log</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">user_time_list</span><span class="p">)</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span>

    <span class="n">u2u_sim_</span> <span class="o">=</span> <span class="n">u2u_sim</span><span class="o">.</span><span class="n">copy</span><span class="p">()</span>
    <span class="k">for</span> <span class="n">u</span><span class="p">,</span> <span class="n">related_users</span> <span class="ow">in</span> <span class="n">u2u_sim</span><span class="o">.</span><span class="n">items</span><span class="p">():</span>
        <span class="k">for</span> <span class="n">v</span><span class="p">,</span> <span class="n">wij</span> <span class="ow">in</span> <span class="n">related_users</span><span class="o">.</span><span class="n">items</span><span class="p">():</span>
            <span class="n">u2u_sim_</span><span class="p">[</span><span class="n">u</span><span class="p">][</span><span class="n">v</span><span class="p">]</span> <span class="o">=</span> <span class="n">wij</span> <span class="o">/</span> <span class="n">math</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">user_cnt</span><span class="p">[</span><span class="n">u</span><span class="p">]</span> <span class="o">*</span> <span class="n">user_cnt</span><span class="p">[</span><span class="n">v</span><span class="p">])</span>

    <span class="c1"># 将得到的相似性矩阵保存到本地</span>
    <span class="n">pickle</span><span class="o">.</span><span class="n">dump</span><span class="p">(</span><span class="n">u2u_sim_</span><span class="p">,</span> <span class="nb">open</span><span class="p">(</span><span class="n">save_path</span> <span class="o">/</span> <span class="s1">&#39;usercf_u2u_sim.pkl&#39;</span><span class="p">,</span> <span class="s1">&#39;wb&#39;</span><span class="p">))</span>

    <span class="k">return</span> <span class="n">u2u_sim_</span>
</pre></div>
</div>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="c1"># 由于usercf计算时候太耗费内存了，这里就不直接运行了</span>
<span class="c1"># 如果是采样的话，是可以运行的</span>
<span class="n">user_activate_degree_dict</span> <span class="o">=</span> <span class="n">get_user_activate_degree_dict</span><span class="p">(</span><span class="n">all_click_df</span><span class="p">)</span>
<span class="n">u2u_sim</span> <span class="o">=</span> <span class="n">usercf_sim</span><span class="p">(</span><span class="n">all_click_df</span><span class="p">,</span> <span class="n">user_activate_degree_dict</span><span class="p">)</span>
</pre></div>
</div>
</section>
<section id="item-embedding-sim">
<h3><span class="section-number">6.4.3.3. </span>item embedding sim<a class="headerlink" href="#item-embedding-sim" title="Permalink to this heading">¶</a></h3>
<p>使用Embedding计算item之间的相似度是为了后续冷启动的时候可以获取未出现在点击数据中的文章，后面有对冷启动专门的介绍，这里简单的说一下faiss。</p>
<p>aiss是Facebook的AI团队开源的一套用于做聚类或者相似性搜索的软件库，底层是用C++实现。Faiss因为超级优越的性能，被广泛应用于推荐相关的业务当中.</p>
<p>faiss工具包一般使用在推荐系统中的向量召回部分。在做向量召回的时候要么是u2u,u2i或者i2i，这里的u和i指的是user和item.我们知道在实际的场景中user和item的数量都是海量的，我们最容易想到的基于向量相似度的召回就是使用两层循环遍历user列表或者item列表计算两个向量的相似度，但是这样做在面对海量数据是不切实际的，faiss就是用来加速计算某个查询向量最相似的topk个索引向量。</p>
<p><strong>faiss查询的原理：</strong></p>
<p>faiss使用了PCA和PQ(Product
quantization乘积量化)两种技术进行向量压缩和编码，当然还使用了其他的技术进行优化，但是PCA和PQ是其中最核心部分。</p>
<ol class="arabic">
<li><div class="line-block">
<div class="line">PCA降维算法细节参考下面这个链接进行学习</div>
<div class="line"><a class="reference external" href="https://www.cnblogs.com/pinard/p/6239403.html">主成分分析（PCA）原理总结</a></div>
</div>
</li>
<li><div class="line-block">
<div class="line">PQ编码的细节下面这个链接进行学习</div>
<div class="line"><a class="reference external" href="http://www.fabwrite.com/productquantization">实例理解product
quantization算法</a></div>
</div>
</li>
</ol>
<p><strong>faiss使用</strong></p>
<p><a class="reference external" href="https://github.com/facebookresearch/faiss/wiki/Getting-started">faiss官方教程</a></p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="c1"># 向量检索相似度计算</span>
<span class="c1"># topk指的是每个item, faiss搜索后返回最相似的topk个item</span>
<span class="k">def</span><span class="w"> </span><span class="nf">embdding_sim</span><span class="p">(</span><span class="n">click_df</span><span class="p">,</span> <span class="n">item_emb_df</span><span class="p">,</span> <span class="n">save_path</span><span class="p">,</span> <span class="n">topk</span><span class="p">):</span>
<span class="w">    </span><span class="sd">&quot;&quot;&quot;</span>
<span class="sd">        基于内容的文章embedding相似性矩阵计算</span>
<span class="sd">        :param click_df: 数据表</span>
<span class="sd">        :param item_emb_df: 文章的embedding</span>
<span class="sd">        :param save_path: 保存路径</span>
<span class="sd">        :patam topk: 找最相似的topk篇</span>
<span class="sd">        return 文章相似性矩阵</span>

<span class="sd">        思路: 对于每一篇文章， 基于embedding的相似性返回topk个与其最相似的文章， 只不过由于文章数量太多，这里用了faiss进行加速</span>
<span class="sd">    &quot;&quot;&quot;</span>

    <span class="c1"># 文章索引与文章id的字典映射</span>
    <span class="n">item_idx_2_rawid_dict</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">(</span><span class="nb">zip</span><span class="p">(</span><span class="n">item_emb_df</span><span class="o">.</span><span class="n">index</span><span class="p">,</span> <span class="n">item_emb_df</span><span class="p">[</span><span class="s1">&#39;article_id&#39;</span><span class="p">]))</span>

    <span class="n">item_emb_cols</span> <span class="o">=</span> <span class="p">[</span><span class="n">x</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">item_emb_df</span><span class="o">.</span><span class="n">columns</span> <span class="k">if</span> <span class="s1">&#39;emb&#39;</span> <span class="ow">in</span> <span class="n">x</span><span class="p">]</span>
    <span class="n">item_emb_np</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">ascontiguousarray</span><span class="p">(</span><span class="n">item_emb_df</span><span class="p">[</span><span class="n">item_emb_cols</span><span class="p">]</span><span class="o">.</span><span class="n">values</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">float32</span><span class="p">)</span>
    <span class="c1"># 向量进行单位化</span>
    <span class="n">item_emb_np</span> <span class="o">=</span> <span class="n">item_emb_np</span> <span class="o">/</span> <span class="n">np</span><span class="o">.</span><span class="n">linalg</span><span class="o">.</span><span class="n">norm</span><span class="p">(</span><span class="n">item_emb_np</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">keepdims</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>

    <span class="c1"># 建立faiss索引</span>
    <span class="n">item_index</span> <span class="o">=</span> <span class="n">faiss</span><span class="o">.</span><span class="n">IndexFlatIP</span><span class="p">(</span><span class="n">item_emb_np</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span>
    <span class="n">item_index</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">item_emb_np</span><span class="p">)</span>
    <span class="c1"># 相似度查询，给每个索引位置上的向量返回topk个item以及相似度</span>
    <span class="n">sim</span><span class="p">,</span> <span class="n">idx</span> <span class="o">=</span> <span class="n">item_index</span><span class="o">.</span><span class="n">search</span><span class="p">(</span><span class="n">item_emb_np</span><span class="p">,</span> <span class="n">topk</span><span class="p">)</span> <span class="c1"># 返回的是列表</span>

    <span class="c1"># 将向量检索的结果保存成原始id的对应关系</span>
    <span class="n">item_sim_dict</span> <span class="o">=</span> <span class="n">defaultdict</span><span class="p">(</span><span class="nb">dict</span><span class="p">)</span>
    <span class="k">for</span> <span class="n">target_idx</span><span class="p">,</span> <span class="n">sim_value_list</span><span class="p">,</span> <span class="n">rele_idx_list</span> <span class="ow">in</span> <span class="n">tqdm</span><span class="p">(</span><span class="nb">zip</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">item_emb_np</span><span class="p">)),</span> <span class="n">sim</span><span class="p">,</span> <span class="n">idx</span><span class="p">)):</span>
        <span class="n">target_raw_id</span> <span class="o">=</span> <span class="n">item_idx_2_rawid_dict</span><span class="p">[</span><span class="n">target_idx</span><span class="p">]</span>
        <span class="c1"># 从1开始是为了去掉商品本身, 所以最终获得的相似商品只有topk-1</span>
        <span class="k">for</span> <span class="n">rele_idx</span><span class="p">,</span> <span class="n">sim_value</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">rele_idx_list</span><span class="p">[</span><span class="mi">1</span><span class="p">:],</span> <span class="n">sim_value_list</span><span class="p">[</span><span class="mi">1</span><span class="p">:]):</span>
            <span class="n">rele_raw_id</span> <span class="o">=</span> <span class="n">item_idx_2_rawid_dict</span><span class="p">[</span><span class="n">rele_idx</span><span class="p">]</span>
            <span class="n">item_sim_dict</span><span class="p">[</span><span class="n">target_raw_id</span><span class="p">][</span><span class="n">rele_raw_id</span><span class="p">]</span> <span class="o">=</span> <span class="n">item_sim_dict</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">target_raw_id</span><span class="p">,</span> <span class="p">{})</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">rele_raw_id</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span> <span class="o">+</span> <span class="n">sim_value</span>

    <span class="c1"># 保存i2i相似度矩阵</span>
    <span class="n">pickle</span><span class="o">.</span><span class="n">dump</span><span class="p">(</span><span class="n">item_sim_dict</span><span class="p">,</span> <span class="nb">open</span><span class="p">(</span><span class="n">save_path</span> <span class="o">/</span> <span class="s1">&#39;emb_i2i_sim.pkl&#39;</span><span class="p">,</span> <span class="s1">&#39;wb&#39;</span><span class="p">))</span>

    <span class="k">return</span> <span class="n">item_sim_dict</span>
</pre></div>
</div>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="c1"># TODO: 这里需要修改, 因为usercf_sim计算太耗费内存了，暂时先采样</span>
<span class="n">item_emb_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="n">data_path</span> <span class="o">/</span> <span class="s1">&#39;articles_emb.csv&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">sample</span><span class="p">(</span><span class="mi">10000</span><span class="p">,</span> <span class="n">random_state</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span><span class="o">.</span><span class="n">reset_index</span><span class="p">(</span><span class="n">drop</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">emb_i2i_sim</span> <span class="o">=</span> <span class="n">embdding_sim</span><span class="p">(</span><span class="n">all_click_df</span><span class="p">,</span> <span class="n">item_emb_df</span><span class="p">,</span> <span class="n">save_path</span><span class="p">,</span> <span class="n">topk</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span> <span class="c1"># topk可以自行设置</span>
</pre></div>
</div>
</section>
</section>
<section id="id12">
<h2><span class="section-number">6.4.4. </span>召回<a class="headerlink" href="#id12" title="Permalink to this heading">¶</a></h2>
<p>这个就是我们开篇提到的那个问题， 面的36万篇文章， 20多万用户的推荐，
我们又有哪些策略来缩减问题的规模？
我们就可以再召回阶段筛选出用户对于点击文章的候选集合，
从而降低问题的规模。召回常用的策略： * Youtube DNN 召回 *
基于文章的召回 * 文章的协同过滤 * 基于文章embedding的召回 *
基于用户的召回 * 用户的协同过滤 * 用户embedding</p>
<p>上面的各种召回方式一部分在基于用户已经看得文章的基础上去召回与这些文章相似的一些文章，
而这个相似性的计算方式不同， 就得到了不同的召回方式，
比如文章的协同过滤，
文章内容的embedding等。还有一部分是根据用户的相似性进行推荐，对于某用户推荐与其相似的其他用户看过的文章，比如用户的协同过滤和用户embedding。
还有一种思路是类似矩阵分解的思路，先计算出用户和文章的embedding之后，就可以直接算用户和文章的相似度，
根据这个相似度进行推荐， 比如YouTube DNN。
我们下面详细来看一下每一个召回方法：</p>
<section id="youtubednn">
<h3><span class="section-number">6.4.4.1. </span>YoutubeDNN召回<a class="headerlink" href="#youtubednn" title="Permalink to this heading">¶</a></h3>
<p><strong>(这一步是直接获取用户召回的候选文章列表)</strong></p>
<p><a class="reference external" href="https://static.googleusercontent.com/media/research.google.com/zh-CN//pubs/archive/45530.pdf">论文下载地址</a></p>
<p><strong>Youtubednn召回架构</strong></p>
<figure class="align-default" id="id17">
<img alt="../_images/youtubednn_candidate.png" src="../_images/youtubednn_candidate.png" />
<figcaption>
<p><span class="caption-number">图6.4.2 </span><span class="caption-text">YoutubeDNN召回架构</span><a class="headerlink" href="#id17" title="Permalink to this image">¶</a></p>
</figcaption>
</figure>
<p>关于YoutubeDNN原理和应用推荐看王喆的两篇博客：</p>
<ol class="arabic simple">
<li><p><a class="reference external" href="https://zhuanlan.zhihu.com/p/52169807">重读Youtube深度学习推荐系统论文，字字珠玑，惊为神文</a></p></li>
<li><p><a class="reference external" href="https://zhuanlan.zhihu.com/p/52504407">YouTube深度学习推荐系统的十大工程问题</a></p></li>
</ol>
<p><strong>参考文献:</strong> 1. <a class="reference external" href="https://zhuanlan.zhihu.com/p/52169807">https://zhuanlan.zhihu.com/p/52169807</a> (YouTubeDNN原理)
2. <a class="reference external" href="https://zhuanlan.zhihu.com/p/26306795">https://zhuanlan.zhihu.com/p/26306795</a> (Word2Vec知乎众赞文章) —
word2vec放到排序中的w2v的介绍部分</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="c1"># 获取双塔召回时的训练验证数据</span>
<span class="c1"># negsample指的是通过滑窗构建样本的时候，负样本的数量</span>
<span class="k">def</span><span class="w"> </span><span class="nf">gen_data_set</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">negsample</span><span class="o">=</span><span class="mi">0</span><span class="p">):</span>
    <span class="n">data</span><span class="o">.</span><span class="n">sort_values</span><span class="p">(</span><span class="s2">&quot;click_timestamp&quot;</span><span class="p">,</span> <span class="n">inplace</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
    <span class="n">item_ids</span> <span class="o">=</span> <span class="n">data</span><span class="p">[</span><span class="s1">&#39;click_article_id&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">unique</span><span class="p">()</span>

    <span class="n">train_set</span> <span class="o">=</span> <span class="p">[]</span>
    <span class="n">test_set</span> <span class="o">=</span> <span class="p">[]</span>
    <span class="k">for</span> <span class="n">reviewerID</span><span class="p">,</span> <span class="n">hist</span> <span class="ow">in</span> <span class="n">tqdm</span><span class="p">(</span><span class="n">data</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s1">&#39;user_id&#39;</span><span class="p">),</span> <span class="n">disable</span><span class="o">=</span><span class="ow">not</span> <span class="n">logger</span><span class="o">.</span><span class="n">isEnabledFor</span><span class="p">(</span><span class="n">logging</span><span class="o">.</span><span class="n">DEBUG</span><span class="p">)):</span>
        <span class="n">pos_list</span> <span class="o">=</span> <span class="n">hist</span><span class="p">[</span><span class="s1">&#39;click_article_id&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">tolist</span><span class="p">()</span>

        <span class="k">if</span> <span class="n">negsample</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">:</span>
            <span class="n">candidate_set</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="nb">set</span><span class="p">(</span><span class="n">item_ids</span><span class="p">)</span> <span class="o">-</span> <span class="nb">set</span><span class="p">(</span><span class="n">pos_list</span><span class="p">))</span>   <span class="c1"># 用户没看过的文章里面选择负样本</span>
            <span class="n">neg_list</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">choice</span><span class="p">(</span><span class="n">candidate_set</span><span class="p">,</span><span class="n">size</span><span class="o">=</span><span class="nb">len</span><span class="p">(</span><span class="n">pos_list</span><span class="p">)</span><span class="o">*</span><span class="n">negsample</span><span class="p">,</span><span class="n">replace</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>  <span class="c1"># 对于每个正样本，选择n个负样本</span>

        <span class="c1"># 长度只有一个的时候，需要把这条数据也放到训练集中，不然的话最终学到的embedding就会有缺失</span>
        <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">pos_list</span><span class="p">)</span> <span class="o">==</span> <span class="mi">1</span><span class="p">:</span>
            <span class="n">train_set</span><span class="o">.</span><span class="n">append</span><span class="p">((</span><span class="n">reviewerID</span><span class="p">,</span> <span class="p">[</span><span class="n">pos_list</span><span class="p">[</span><span class="mi">0</span><span class="p">]],</span> <span class="n">pos_list</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span><span class="mi">1</span><span class="p">,</span><span class="nb">len</span><span class="p">(</span><span class="n">pos_list</span><span class="p">)))</span>
            <span class="n">test_set</span><span class="o">.</span><span class="n">append</span><span class="p">((</span><span class="n">reviewerID</span><span class="p">,</span> <span class="p">[</span><span class="n">pos_list</span><span class="p">[</span><span class="mi">0</span><span class="p">]],</span> <span class="n">pos_list</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span><span class="mi">1</span><span class="p">,</span><span class="nb">len</span><span class="p">(</span><span class="n">pos_list</span><span class="p">)))</span>

        <span class="c1"># 滑窗构造正负样本</span>
        <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">pos_list</span><span class="p">)):</span>
            <span class="n">hist</span> <span class="o">=</span> <span class="n">pos_list</span><span class="p">[:</span><span class="n">i</span><span class="p">]</span>

            <span class="k">if</span> <span class="n">i</span> <span class="o">!=</span> <span class="nb">len</span><span class="p">(</span><span class="n">pos_list</span><span class="p">)</span> <span class="o">-</span> <span class="mi">1</span><span class="p">:</span>
                <span class="n">train_set</span><span class="o">.</span><span class="n">append</span><span class="p">((</span><span class="n">reviewerID</span><span class="p">,</span> <span class="n">hist</span><span class="p">[::</span><span class="o">-</span><span class="mi">1</span><span class="p">],</span> <span class="n">pos_list</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="mi">1</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">hist</span><span class="p">[::</span><span class="o">-</span><span class="mi">1</span><span class="p">])))</span>  <span class="c1"># 正样本 [user_id, his_item, pos_item, label, len(his_item)]</span>
                <span class="k">for</span> <span class="n">negi</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">negsample</span><span class="p">):</span>
                    <span class="n">train_set</span><span class="o">.</span><span class="n">append</span><span class="p">((</span><span class="n">reviewerID</span><span class="p">,</span> <span class="n">hist</span><span class="p">[::</span><span class="o">-</span><span class="mi">1</span><span class="p">],</span> <span class="n">neg_list</span><span class="p">[</span><span class="n">i</span><span class="o">*</span><span class="n">negsample</span><span class="o">+</span><span class="n">negi</span><span class="p">],</span> <span class="mi">0</span><span class="p">,</span><span class="nb">len</span><span class="p">(</span><span class="n">hist</span><span class="p">[::</span><span class="o">-</span><span class="mi">1</span><span class="p">])))</span> <span class="c1"># 负样本 [user_id, his_item, neg_item, label, len(his_item)]</span>
            <span class="k">else</span><span class="p">:</span>
                <span class="c1"># 将最长的那一个序列长度作为测试数据</span>
                <span class="n">test_set</span><span class="o">.</span><span class="n">append</span><span class="p">((</span><span class="n">reviewerID</span><span class="p">,</span> <span class="n">hist</span><span class="p">[::</span><span class="o">-</span><span class="mi">1</span><span class="p">],</span> <span class="n">pos_list</span><span class="p">[</span><span class="n">i</span><span class="p">],</span><span class="mi">1</span><span class="p">,</span><span class="nb">len</span><span class="p">(</span><span class="n">hist</span><span class="p">[::</span><span class="o">-</span><span class="mi">1</span><span class="p">])))</span>

    <span class="n">random</span><span class="o">.</span><span class="n">shuffle</span><span class="p">(</span><span class="n">train_set</span><span class="p">)</span>
    <span class="n">random</span><span class="o">.</span><span class="n">shuffle</span><span class="p">(</span><span class="n">test_set</span><span class="p">)</span>

    <span class="k">return</span> <span class="n">train_set</span><span class="p">,</span> <span class="n">test_set</span>

<span class="c1"># 将输入的数据进行padding，使得序列特征的长度都一致</span>
<span class="k">def</span><span class="w"> </span><span class="nf">gen_model_input</span><span class="p">(</span><span class="n">train_set</span><span class="p">,</span><span class="n">user_profile</span><span class="p">,</span><span class="n">seq_max_len</span><span class="p">):</span>

    <span class="n">train_uid</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="n">line</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="k">for</span> <span class="n">line</span> <span class="ow">in</span> <span class="n">train_set</span><span class="p">])</span>
    <span class="n">train_seq</span> <span class="o">=</span> <span class="p">[</span><span class="n">line</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="k">for</span> <span class="n">line</span> <span class="ow">in</span> <span class="n">train_set</span><span class="p">]</span>
    <span class="n">train_iid</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="n">line</span><span class="p">[</span><span class="mi">2</span><span class="p">]</span> <span class="k">for</span> <span class="n">line</span> <span class="ow">in</span> <span class="n">train_set</span><span class="p">])</span>
    <span class="n">train_label</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="n">line</span><span class="p">[</span><span class="mi">3</span><span class="p">]</span> <span class="k">for</span> <span class="n">line</span> <span class="ow">in</span> <span class="n">train_set</span><span class="p">])</span>
    <span class="n">train_hist_len</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="n">line</span><span class="p">[</span><span class="mi">4</span><span class="p">]</span> <span class="k">for</span> <span class="n">line</span> <span class="ow">in</span> <span class="n">train_set</span><span class="p">])</span>

    <span class="n">train_seq_pad</span> <span class="o">=</span> <span class="n">pad_sequences</span><span class="p">(</span><span class="n">train_seq</span><span class="p">,</span> <span class="n">maxlen</span><span class="o">=</span><span class="n">seq_max_len</span><span class="p">,</span> <span class="n">padding</span><span class="o">=</span><span class="s1">&#39;post&#39;</span><span class="p">,</span> <span class="n">truncating</span><span class="o">=</span><span class="s1">&#39;post&#39;</span><span class="p">,</span> <span class="n">value</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
    <span class="n">train_model_input</span> <span class="o">=</span> <span class="p">{</span><span class="s2">&quot;user_id&quot;</span><span class="p">:</span> <span class="n">train_uid</span><span class="p">,</span> <span class="s2">&quot;click_article_id&quot;</span><span class="p">:</span> <span class="n">train_iid</span><span class="p">,</span> <span class="s2">&quot;hist_article_id&quot;</span><span class="p">:</span> <span class="n">train_seq_pad</span><span class="p">,</span>
                         <span class="s2">&quot;hist_len&quot;</span><span class="p">:</span> <span class="n">train_hist_len</span><span class="p">}</span>

    <span class="k">return</span> <span class="n">train_model_input</span><span class="p">,</span> <span class="n">train_label</span>
</pre></div>
</div>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="c1"># funrec youtubeDNN召回</span>
<span class="k">def</span><span class="w"> </span><span class="nf">youtubednn_u2i_dict</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">topk</span><span class="o">=</span><span class="mi">20</span><span class="p">):</span>
<span class="w">    </span><span class="sd">&quot;&quot;&quot;</span>
<span class="sd">    使用 FunRec 的 YouTubeDNN 两塔模型进行召回，保持与当前逻辑一致的预处理：</span>
<span class="sd">    - 标签/目标为正样本采样（sampled softmax 内部使用 item_id 作为 label）</span>
<span class="sd">    - 通过滑窗构造训练/测试样本，使用最近序列作为测试</span>
<span class="sd">    - 历史序列长度固定为 SEQ_LEN，并做 post-padding</span>
<span class="sd">    - 训练完成后提取 user/item embedding，使用 FAISS 基于内积做 TopK 近邻召回</span>
<span class="sd">    - 返回 {user_raw_id: [(item_raw_id, score), ...]} 的召回结果字典</span>
<span class="sd">    &quot;&quot;&quot;</span>
    <span class="kn">import</span><span class="w"> </span><span class="nn">sys</span>
    <span class="kn">import</span><span class="w"> </span><span class="nn">numpy</span><span class="w"> </span><span class="k">as</span><span class="w"> </span><span class="nn">np</span>
    <span class="kn">import</span><span class="w"> </span><span class="nn">pickle</span>
    <span class="kn">from</span><span class="w"> </span><span class="nn">tqdm</span><span class="w"> </span><span class="kn">import</span> <span class="n">tqdm</span>
    <span class="kn">from</span><span class="w"> </span><span class="nn">sklearn.preprocessing</span><span class="w"> </span><span class="kn">import</span> <span class="n">LabelEncoder</span>

    <span class="kn">from</span><span class="w"> </span><span class="nn">funrec.features.feature_column</span><span class="w"> </span><span class="kn">import</span> <span class="n">FeatureColumn</span>
    <span class="kn">from</span><span class="w"> </span><span class="nn">funrec.training.trainer</span><span class="w"> </span><span class="kn">import</span> <span class="n">train_model</span>

    <span class="c1"># 内联配置（参考 config_youtubednn.yaml，并适配当前数据列名）</span>
    <span class="n">SEQ_LEN</span> <span class="o">=</span> <span class="mi">30</span>
    <span class="n">emb_dim</span> <span class="o">=</span> <span class="mi">16</span>
    <span class="n">neg_sample</span> <span class="o">=</span> <span class="mi">20</span>
    <span class="n">dnn_units</span> <span class="o">=</span> <span class="p">[</span><span class="mi">32</span><span class="p">]</span>
    <span class="n">label_name</span> <span class="o">=</span> <span class="s1">&#39;click_article_id&#39;</span>

    <span class="c1"># 拷贝并做类别编码（与现有逻辑保持一致）</span>
    <span class="n">df</span> <span class="o">=</span> <span class="n">data</span><span class="o">.</span><span class="n">copy</span><span class="p">()</span>
    <span class="n">user_profile_raw</span> <span class="o">=</span> <span class="n">df</span><span class="p">[[</span><span class="s2">&quot;user_id&quot;</span><span class="p">]]</span><span class="o">.</span><span class="n">drop_duplicates</span><span class="p">(</span><span class="s1">&#39;user_id&#39;</span><span class="p">)</span>
    <span class="n">item_profile_raw</span> <span class="o">=</span> <span class="n">df</span><span class="p">[[</span><span class="s2">&quot;click_article_id&quot;</span><span class="p">]]</span><span class="o">.</span><span class="n">drop_duplicates</span><span class="p">(</span><span class="s1">&#39;click_article_id&#39;</span><span class="p">)</span>

    <span class="n">encoders</span> <span class="o">=</span> <span class="p">{}</span>
    <span class="n">feature_max_idx</span> <span class="o">=</span> <span class="p">{}</span>
    <span class="k">for</span> <span class="n">col</span> <span class="ow">in</span> <span class="p">[</span><span class="s2">&quot;user_id&quot;</span><span class="p">,</span> <span class="s2">&quot;click_article_id&quot;</span><span class="p">]:</span>
        <span class="n">lbe</span> <span class="o">=</span> <span class="n">LabelEncoder</span><span class="p">()</span>
        <span class="n">df</span><span class="p">[</span><span class="n">col</span><span class="p">]</span> <span class="o">=</span> <span class="n">lbe</span><span class="o">.</span><span class="n">fit_transform</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="n">col</span><span class="p">])</span>
        <span class="n">encoders</span><span class="p">[</span><span class="n">col</span><span class="p">]</span> <span class="o">=</span> <span class="n">lbe</span>
        <span class="n">feature_max_idx</span><span class="p">[</span><span class="n">col</span><span class="p">]</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="n">col</span><span class="p">]</span><span class="o">.</span><span class="n">max</span><span class="p">())</span> <span class="o">+</span> <span class="mi">1</span>

    <span class="c1"># 画像（仅用于 id 回退映射）</span>
    <span class="n">user_profile</span> <span class="o">=</span> <span class="n">df</span><span class="p">[[</span><span class="s2">&quot;user_id&quot;</span><span class="p">]]</span><span class="o">.</span><span class="n">drop_duplicates</span><span class="p">(</span><span class="s1">&#39;user_id&#39;</span><span class="p">)</span>
    <span class="n">item_profile</span> <span class="o">=</span> <span class="n">df</span><span class="p">[[</span><span class="s2">&quot;click_article_id&quot;</span><span class="p">]]</span><span class="o">.</span><span class="n">drop_duplicates</span><span class="p">(</span><span class="s1">&#39;click_article_id&#39;</span><span class="p">)</span>
    <span class="n">user_index_2_rawid</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">(</span><span class="nb">zip</span><span class="p">(</span><span class="n">user_profile</span><span class="p">[</span><span class="s1">&#39;user_id&#39;</span><span class="p">],</span> <span class="n">user_profile_raw</span><span class="p">[</span><span class="s1">&#39;user_id&#39;</span><span class="p">]))</span>
    <span class="n">item_index_2_rawid</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">(</span><span class="nb">zip</span><span class="p">(</span><span class="n">item_profile</span><span class="p">[</span><span class="s1">&#39;click_article_id&#39;</span><span class="p">],</span> <span class="n">item_profile_raw</span><span class="p">[</span><span class="s1">&#39;click_article_id&#39;</span><span class="p">]))</span>

    <span class="c1"># 按当前逻辑构造训练/测试样本</span>
    <span class="n">train_set</span><span class="p">,</span> <span class="n">test_set</span> <span class="o">=</span> <span class="n">gen_data_set</span><span class="p">(</span><span class="n">df</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>
    <span class="n">train_model_input</span><span class="p">,</span> <span class="n">_</span> <span class="o">=</span> <span class="n">gen_model_input</span><span class="p">(</span><span class="n">train_set</span><span class="p">,</span> <span class="n">user_profile</span><span class="p">,</span> <span class="n">SEQ_LEN</span><span class="p">)</span>
    <span class="n">test_model_input</span><span class="p">,</span> <span class="n">_</span> <span class="o">=</span> <span class="n">gen_model_input</span><span class="p">(</span><span class="n">test_set</span><span class="p">,</span> <span class="n">user_profile</span><span class="p">,</span> <span class="n">SEQ_LEN</span><span class="p">)</span>

    <span class="c1"># 仅保留模型实际需要的输入键</span>
    <span class="n">input_keys</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;user_id&#39;</span><span class="p">,</span> <span class="s1">&#39;click_article_id&#39;</span><span class="p">,</span> <span class="s1">&#39;hist_article_id&#39;</span><span class="p">]</span>
    <span class="n">train_X</span> <span class="o">=</span> <span class="p">{</span><span class="n">k</span><span class="p">:</span> <span class="n">np</span><span class="o">.</span><span class="n">asarray</span><span class="p">(</span><span class="n">train_model_input</span><span class="p">[</span><span class="n">k</span><span class="p">],</span> <span class="n">dtype</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">int32</span><span class="p">)</span> <span class="k">for</span> <span class="n">k</span> <span class="ow">in</span> <span class="n">input_keys</span><span class="p">}</span>
    <span class="n">test_X</span> <span class="o">=</span> <span class="p">{</span><span class="n">k</span><span class="p">:</span> <span class="n">np</span><span class="o">.</span><span class="n">asarray</span><span class="p">(</span><span class="n">test_model_input</span><span class="p">[</span><span class="n">k</span><span class="p">],</span> <span class="n">dtype</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">int32</span><span class="p">)</span> <span class="k">for</span> <span class="n">k</span> <span class="ow">in</span> <span class="n">input_keys</span><span class="p">}</span>

    <span class="c1"># 手动定义特征列（不依赖外部数据字典）</span>
    <span class="n">feature_columns</span> <span class="o">=</span> <span class="p">[</span>
        <span class="n">FeatureColumn</span><span class="p">(</span><span class="n">name</span><span class="o">=</span><span class="s1">&#39;user_id&#39;</span><span class="p">,</span> <span class="n">group</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;user_dnn&#39;</span><span class="p">],</span> <span class="nb">type</span><span class="o">=</span><span class="s1">&#39;sparse&#39;</span><span class="p">,</span> <span class="n">vocab_size</span><span class="o">=</span><span class="n">feature_max_idx</span><span class="p">[</span><span class="s1">&#39;user_id&#39;</span><span class="p">],</span> <span class="n">emb_dim</span><span class="o">=</span><span class="n">emb_dim</span><span class="p">),</span>
        <span class="n">FeatureColumn</span><span class="p">(</span><span class="n">name</span><span class="o">=</span><span class="s1">&#39;click_article_id&#39;</span><span class="p">,</span> <span class="n">group</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;target_item&#39;</span><span class="p">],</span> <span class="nb">type</span><span class="o">=</span><span class="s1">&#39;sparse&#39;</span><span class="p">,</span> <span class="n">vocab_size</span><span class="o">=</span><span class="n">feature_max_idx</span><span class="p">[</span><span class="s1">&#39;click_article_id&#39;</span><span class="p">],</span> <span class="n">emb_dim</span><span class="o">=</span><span class="n">emb_dim</span><span class="p">),</span>
        <span class="n">FeatureColumn</span><span class="p">(</span><span class="n">name</span><span class="o">=</span><span class="s1">&#39;hist_article_id&#39;</span><span class="p">,</span> <span class="n">emb_name</span><span class="o">=</span><span class="s1">&#39;click_article_id&#39;</span><span class="p">,</span> <span class="n">group</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;raw_hist_seq&#39;</span><span class="p">],</span> <span class="nb">type</span><span class="o">=</span><span class="s1">&#39;varlen_sparse&#39;</span><span class="p">,</span> <span class="n">max_len</span><span class="o">=</span><span class="n">SEQ_LEN</span><span class="p">,</span> <span class="n">combiner</span><span class="o">=</span><span class="s1">&#39;mean&#39;</span><span class="p">,</span> <span class="n">emb_dim</span><span class="o">=</span><span class="n">emb_dim</span><span class="p">,</span> <span class="n">vocab_size</span><span class="o">=</span><span class="n">feature_max_idx</span><span class="p">[</span><span class="s1">&#39;click_article_id&#39;</span><span class="p">]),</span>
    <span class="p">]</span>

    <span class="c1"># 组装 processed_data（与 FunRec 训练器期望的结构一致）</span>
    <span class="n">processed_data</span> <span class="o">=</span> <span class="p">{</span>
        <span class="s1">&#39;train&#39;</span><span class="p">:</span> <span class="p">{</span>
            <span class="s1">&#39;features&#39;</span><span class="p">:</span> <span class="n">train_X</span><span class="p">,</span>
            <span class="s1">&#39;labels&#39;</span><span class="p">:</span> <span class="kc">None</span>  <span class="c1"># 由 positive_sampling_labels 规则内部替换为全 1</span>
        <span class="p">},</span>
        <span class="s1">&#39;test&#39;</span><span class="p">:</span> <span class="p">{</span>
            <span class="s1">&#39;features&#39;</span><span class="p">:</span> <span class="n">test_X</span><span class="p">,</span>
            <span class="s1">&#39;labels&#39;</span><span class="p">:</span> <span class="kc">None</span><span class="p">,</span>
            <span class="s1">&#39;eval_data&#39;</span><span class="p">:</span> <span class="p">{}</span>
        <span class="p">},</span>
        <span class="s1">&#39;all_items&#39;</span><span class="p">:</span> <span class="p">{</span>
            <span class="s1">&#39;click_article_id&#39;</span><span class="p">:</span> <span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="n">feature_max_idx</span><span class="p">[</span><span class="s1">&#39;click_article_id&#39;</span><span class="p">],</span> <span class="n">dtype</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">int32</span><span class="p">)</span>
        <span class="p">},</span>
        <span class="s1">&#39;feature_dict&#39;</span><span class="p">:</span> <span class="p">{</span>
            <span class="s1">&#39;user_id&#39;</span><span class="p">:</span> <span class="n">feature_max_idx</span><span class="p">[</span><span class="s1">&#39;user_id&#39;</span><span class="p">],</span>
            <span class="s1">&#39;click_article_id&#39;</span><span class="p">:</span> <span class="n">feature_max_idx</span><span class="p">[</span><span class="s1">&#39;click_article_id&#39;</span><span class="p">]</span>
        <span class="p">}</span>
    <span class="p">}</span>

    <span class="c1"># 训练配置（内联）</span>
    <span class="n">training_config</span> <span class="o">=</span> <span class="p">{</span>
        <span class="s1">&#39;build_function&#39;</span><span class="p">:</span> <span class="s1">&#39;funrec.models.youtubednn.build_youtubednn_model&#39;</span><span class="p">,</span>
        <span class="s1">&#39;data_preprocessing&#39;</span><span class="p">:</span> <span class="p">[</span>
            <span class="p">{</span><span class="s1">&#39;type&#39;</span><span class="p">:</span> <span class="s1">&#39;positive_sampling_labels&#39;</span><span class="p">}</span>
        <span class="p">],</span>
        <span class="s1">&#39;model_params&#39;</span><span class="p">:</span> <span class="p">{</span>
            <span class="s1">&#39;emb_dim&#39;</span><span class="p">:</span> <span class="n">emb_dim</span><span class="p">,</span>
            <span class="s1">&#39;neg_sample&#39;</span><span class="p">:</span> <span class="n">neg_sample</span><span class="p">,</span>
            <span class="s1">&#39;dnn_units&#39;</span><span class="p">:</span> <span class="n">dnn_units</span><span class="p">,</span>
            <span class="s1">&#39;label_name&#39;</span><span class="p">:</span> <span class="n">label_name</span>
        <span class="p">},</span>
        <span class="s1">&#39;optimizer&#39;</span><span class="p">:</span> <span class="s1">&#39;adam&#39;</span><span class="p">,</span>
        <span class="s1">&#39;optimizer_params&#39;</span><span class="p">:</span> <span class="p">{</span>
            <span class="s1">&#39;learning_rate&#39;</span><span class="p">:</span> <span class="mf">1e-4</span>
        <span class="p">},</span>
        <span class="s1">&#39;loss&#39;</span><span class="p">:</span> <span class="s1">&#39;sampledsoftmaxloss&#39;</span><span class="p">,</span>
        <span class="s1">&#39;batch_size&#39;</span><span class="p">:</span> <span class="mi">128</span><span class="p">,</span>
        <span class="s1">&#39;epochs&#39;</span><span class="p">:</span> <span class="mi">5</span><span class="p">,</span>
        <span class="s1">&#39;verbose&#39;</span><span class="p">:</span> <span class="mi">0</span>
    <span class="p">}</span>

    <span class="c1"># 训练模型（返回 main_model, user_model, item_model）</span>
    <span class="n">model</span><span class="p">,</span> <span class="n">user_model</span><span class="p">,</span> <span class="n">item_model</span> <span class="o">=</span> <span class="n">train_model</span><span class="p">(</span><span class="n">training_config</span><span class="p">,</span> <span class="n">feature_columns</span><span class="p">,</span> <span class="n">processed_data</span><span class="p">)</span>

    <span class="c1"># 提取 embedding</span>
    <span class="n">user_inputs_for_pred</span> <span class="o">=</span> <span class="p">{</span><span class="n">k</span><span class="p">:</span> <span class="n">test_X</span><span class="p">[</span><span class="n">k</span><span class="p">]</span> <span class="k">for</span> <span class="n">k</span> <span class="ow">in</span> <span class="p">[</span><span class="s1">&#39;user_id&#39;</span><span class="p">,</span> <span class="s1">&#39;hist_article_id&#39;</span><span class="p">]}</span>
    <span class="n">user_embs</span> <span class="o">=</span> <span class="n">user_model</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">user_inputs_for_pred</span><span class="p">,</span> <span class="n">batch_size</span><span class="o">=</span><span class="mi">2</span> <span class="o">**</span> <span class="mi">12</span><span class="p">,</span> <span class="n">verbose</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
    <span class="n">item_embs</span> <span class="o">=</span> <span class="n">item_model</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">processed_data</span><span class="p">[</span><span class="s1">&#39;all_items&#39;</span><span class="p">],</span> <span class="n">batch_size</span><span class="o">=</span><span class="mi">2</span> <span class="o">**</span> <span class="mi">12</span><span class="p">,</span> <span class="n">verbose</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>

    <span class="c1"># 归一化（与现有逻辑一致）</span>
    <span class="n">user_embs</span> <span class="o">=</span> <span class="n">user_embs</span> <span class="o">/</span> <span class="n">np</span><span class="o">.</span><span class="n">linalg</span><span class="o">.</span><span class="n">norm</span><span class="p">(</span><span class="n">user_embs</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">keepdims</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
    <span class="n">item_embs</span> <span class="o">=</span> <span class="n">item_embs</span> <span class="o">/</span> <span class="n">np</span><span class="o">.</span><span class="n">linalg</span><span class="o">.</span><span class="n">norm</span><span class="p">(</span><span class="n">item_embs</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">keepdims</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>

    <span class="c1"># 保存 embedding（与现有逻辑一致，注意 id 回退）</span>
    <span class="n">raw_user_id_emb_dict</span> <span class="o">=</span> <span class="p">{</span><span class="n">user_index_2_rawid</span><span class="p">[</span><span class="n">k</span><span class="p">]:</span> <span class="n">v</span> <span class="k">for</span> <span class="n">k</span><span class="p">,</span> <span class="n">v</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">test_X</span><span class="p">[</span><span class="s1">&#39;user_id&#39;</span><span class="p">],</span> <span class="n">user_embs</span><span class="p">)}</span>
    <span class="n">raw_item_id_emb_dict</span> <span class="o">=</span> <span class="p">{</span><span class="n">item_index_2_rawid</span><span class="p">[</span><span class="n">k</span><span class="p">]:</span> <span class="n">v</span> <span class="k">for</span> <span class="n">k</span><span class="p">,</span> <span class="n">v</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">processed_data</span><span class="p">[</span><span class="s1">&#39;all_items&#39;</span><span class="p">][</span><span class="s1">&#39;click_article_id&#39;</span><span class="p">],</span> <span class="n">item_embs</span><span class="p">)}</span>
    <span class="n">pickle</span><span class="o">.</span><span class="n">dump</span><span class="p">(</span><span class="n">raw_user_id_emb_dict</span><span class="p">,</span> <span class="nb">open</span><span class="p">(</span><span class="n">save_path</span> <span class="o">/</span> <span class="s1">&#39;user_youtube_emb.pkl&#39;</span><span class="p">,</span> <span class="s1">&#39;wb&#39;</span><span class="p">))</span>
    <span class="n">pickle</span><span class="o">.</span><span class="n">dump</span><span class="p">(</span><span class="n">raw_item_id_emb_dict</span><span class="p">,</span> <span class="nb">open</span><span class="p">(</span><span class="n">save_path</span> <span class="o">/</span> <span class="s1">&#39;item_youtube_emb.pkl&#39;</span><span class="p">,</span> <span class="s1">&#39;wb&#39;</span><span class="p">))</span>

    <span class="c1"># 使用 FAISS 做向量检索召回</span>
    <span class="n">index</span> <span class="o">=</span> <span class="n">faiss</span><span class="o">.</span><span class="n">IndexFlatIP</span><span class="p">(</span><span class="n">emb_dim</span><span class="p">)</span>
    <span class="n">index</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">item_embs</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">float32</span><span class="p">))</span>
    <span class="n">sim</span><span class="p">,</span> <span class="n">idx</span> <span class="o">=</span> <span class="n">index</span><span class="o">.</span><span class="n">search</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">ascontiguousarray</span><span class="p">(</span><span class="n">user_embs</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">float32</span><span class="p">)),</span> <span class="n">topk</span><span class="p">)</span>

    <span class="n">user_recall_items_dict</span> <span class="o">=</span> <span class="n">defaultdict</span><span class="p">(</span><span class="nb">dict</span><span class="p">)</span>
    <span class="k">for</span> <span class="n">target_idx</span><span class="p">,</span> <span class="n">sim_value_list</span><span class="p">,</span> <span class="n">rele_idx_list</span> <span class="ow">in</span> <span class="n">tqdm</span><span class="p">(</span><span class="nb">zip</span><span class="p">(</span><span class="n">test_X</span><span class="p">[</span><span class="s1">&#39;user_id&#39;</span><span class="p">],</span> <span class="n">sim</span><span class="p">,</span> <span class="n">idx</span><span class="p">),</span> <span class="n">disable</span><span class="o">=</span><span class="ow">not</span> <span class="n">logger</span><span class="o">.</span><span class="n">isEnabledFor</span><span class="p">(</span><span class="n">logging</span><span class="o">.</span><span class="n">DEBUG</span><span class="p">)):</span>
        <span class="n">target_raw_id</span> <span class="o">=</span> <span class="n">user_index_2_rawid</span><span class="p">[</span><span class="nb">int</span><span class="p">(</span><span class="n">target_idx</span><span class="p">)]</span>
        <span class="c1"># 从 1 开始去掉最相似的第一个（通常为本身或极近邻）</span>
        <span class="k">for</span> <span class="n">rele_idx</span><span class="p">,</span> <span class="n">sim_value</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">rele_idx_list</span><span class="p">[</span><span class="mi">1</span><span class="p">:],</span> <span class="n">sim_value_list</span><span class="p">[</span><span class="mi">1</span><span class="p">:]):</span>
            <span class="n">rele_raw_id</span> <span class="o">=</span> <span class="n">item_index_2_rawid</span><span class="p">[</span><span class="nb">int</span><span class="p">(</span><span class="n">rele_idx</span><span class="p">)]</span>
            <span class="n">user_recall_items_dict</span><span class="p">[</span><span class="n">target_raw_id</span><span class="p">][</span><span class="n">rele_raw_id</span><span class="p">]</span> <span class="o">=</span> <span class="n">user_recall_items_dict</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">target_raw_id</span><span class="p">,</span> <span class="p">{})</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">rele_raw_id</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span> <span class="o">+</span> <span class="nb">float</span><span class="p">(</span><span class="n">sim_value</span><span class="p">)</span>

    <span class="c1"># 排序并保存</span>
    <span class="n">user_recall_items_dict</span> <span class="o">=</span> <span class="p">{</span><span class="n">k</span><span class="p">:</span> <span class="nb">sorted</span><span class="p">(</span><span class="n">v</span><span class="o">.</span><span class="n">items</span><span class="p">(),</span> <span class="n">key</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">reverse</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span> <span class="k">for</span> <span class="n">k</span><span class="p">,</span> <span class="n">v</span> <span class="ow">in</span> <span class="n">user_recall_items_dict</span><span class="o">.</span><span class="n">items</span><span class="p">()}</span>
    <span class="n">pickle</span><span class="o">.</span><span class="n">dump</span><span class="p">(</span><span class="n">user_recall_items_dict</span><span class="p">,</span> <span class="nb">open</span><span class="p">(</span><span class="n">save_path</span> <span class="o">/</span> <span class="s1">&#39;youtube_u2i_dict.pkl&#39;</span><span class="p">,</span> <span class="s1">&#39;wb&#39;</span><span class="p">))</span>
    <span class="k">return</span> <span class="n">user_recall_items_dict</span>
</pre></div>
</div>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="c1"># 由于这里需要做召回评估，所以讲训练集中的最后一次点击都提取了出来</span>
<span class="k">if</span> <span class="ow">not</span> <span class="n">metric_recall</span><span class="p">:</span>
    <span class="n">user_multi_recall_dict</span><span class="p">[</span><span class="s1">&#39;youtubednn_recall&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">youtubednn_u2i_dict</span><span class="p">(</span><span class="n">all_click_df</span><span class="p">,</span> <span class="n">topk</span><span class="o">=</span><span class="mi">20</span><span class="p">)</span>
<span class="k">else</span><span class="p">:</span>
    <span class="n">trn_hist_click_df</span><span class="p">,</span> <span class="n">trn_last_click_df</span> <span class="o">=</span> <span class="n">get_hist_and_last_click</span><span class="p">(</span><span class="n">all_click_df</span><span class="p">)</span>
    <span class="n">user_multi_recall_dict</span><span class="p">[</span><span class="s1">&#39;youtubednn_recall&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">youtubednn_u2i_dict</span><span class="p">(</span><span class="n">trn_hist_click_df</span><span class="p">,</span> <span class="n">topk</span><span class="o">=</span><span class="mi">20</span><span class="p">)</span>
    <span class="c1"># 召回效果评估</span>
    <span class="n">metrics_recall</span><span class="p">(</span><span class="n">user_multi_recall_dict</span><span class="p">[</span><span class="s1">&#39;youtubednn_recall&#39;</span><span class="p">],</span> <span class="n">trn_last_click_df</span><span class="p">,</span> <span class="n">topk</span><span class="o">=</span><span class="mi">20</span><span class="p">)</span>
</pre></div>
</div>
</section>
<section id="itemcf-recall">
<h3><span class="section-number">6.4.4.2. </span>itemCF recall<a class="headerlink" href="#itemcf-recall" title="Permalink to this heading">¶</a></h3>
<p>上面已经通过协同过滤，Embedding检索的方式得到了文章的相似度矩阵，下面使用协同过滤的思想，给用户召回与其历史文章相似的文章。
这里在召回的时候，也是用了关联规则的方式：</p>
<ol class="arabic simple">
<li><p>考虑相似文章与历史点击文章顺序的权重(细节看代码)</p></li>
<li><p>考虑文章创建时间的权重，也就是考虑相似文章与历史点击文章创建时间差的权重</p></li>
<li><p>考虑文章内容相似度权重(使用Embedding计算相似文章相似度，但是这里需要注意，在Embedding的时候并没有计算所有商品两两之间的相似度，所以相似的文章与历史点击文章不存在相似度，需要做特殊处理)</p></li>
</ol>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="c1"># 基于商品的召回i2i</span>
<span class="k">def</span><span class="w"> </span><span class="nf">item_based_recommend</span><span class="p">(</span><span class="n">user_id</span><span class="p">,</span> <span class="n">user_item_time_dict</span><span class="p">,</span> <span class="n">i2i_sim</span><span class="p">,</span> <span class="n">sim_item_topk</span><span class="p">,</span> <span class="n">recall_item_num</span><span class="p">,</span> <span class="n">item_topk_click</span><span class="p">,</span> <span class="n">item_created_time_dict</span><span class="p">,</span> <span class="n">emb_i2i_sim</span><span class="p">):</span>
<span class="w">    </span><span class="sd">&quot;&quot;&quot;</span>
<span class="sd">        基于文章协同过滤的召回</span>
<span class="sd">        :param user_id: 用户id</span>
<span class="sd">        :param user_item_time_dict: 字典, 根据点击时间获取用户的点击文章序列   {user1: {item1: time1, item2: time2..}...}</span>
<span class="sd">        :param i2i_sim: 字典，文章相似性矩阵</span>
<span class="sd">        :param sim_item_topk: 整数， 选择与当前文章最相似的前k篇文章</span>
<span class="sd">        :param recall_item_num: 整数， 最后的召回文章数量</span>
<span class="sd">        :param item_topk_click: 列表，点击次数最多的文章列表，用户召回补全</span>
<span class="sd">        :param emb_i2i_sim: 字典基于内容embedding算的文章相似矩阵</span>

<span class="sd">        return: 召回的文章列表 {item1:score1, item2: score2...}</span>

<span class="sd">    &quot;&quot;&quot;</span>
    <span class="c1"># 获取用户历史交互的文章</span>
    <span class="n">user_hist_items</span> <span class="o">=</span> <span class="n">user_item_time_dict</span><span class="p">[</span><span class="n">user_id</span><span class="p">]</span>

    <span class="n">item_rank</span> <span class="o">=</span> <span class="p">{}</span>
    <span class="k">for</span> <span class="n">loc</span><span class="p">,</span> <span class="p">(</span><span class="n">i</span><span class="p">,</span> <span class="n">click_time</span><span class="p">)</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">user_hist_items</span><span class="p">):</span>
        <span class="k">for</span> <span class="n">j</span><span class="p">,</span> <span class="n">wij</span> <span class="ow">in</span> <span class="nb">sorted</span><span class="p">(</span><span class="n">i2i_sim</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">.</span><span class="n">items</span><span class="p">(),</span> <span class="n">key</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">reverse</span><span class="o">=</span><span class="kc">True</span><span class="p">)[:</span><span class="n">sim_item_topk</span><span class="p">]:</span>
            <span class="k">if</span> <span class="n">j</span> <span class="ow">in</span> <span class="n">user_hist_items</span><span class="p">:</span>
                <span class="k">continue</span>

            <span class="c1"># 文章创建时间差权重</span>
            <span class="n">created_time_weight</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">exp</span><span class="p">(</span><span class="mf">0.8</span> <span class="o">**</span> <span class="n">np</span><span class="o">.</span><span class="n">abs</span><span class="p">(</span><span class="n">item_created_time_dict</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">-</span> <span class="n">item_created_time_dict</span><span class="p">[</span><span class="n">j</span><span class="p">]))</span>
            <span class="c1"># 相似文章和历史点击文章序列中历史文章所在的位置权重</span>
            <span class="n">loc_weight</span> <span class="o">=</span> <span class="p">(</span><span class="mf">0.9</span> <span class="o">**</span> <span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">user_hist_items</span><span class="p">)</span> <span class="o">-</span> <span class="n">loc</span><span class="p">))</span>

            <span class="n">content_weight</span> <span class="o">=</span> <span class="mf">1.0</span>
            <span class="k">if</span> <span class="n">emb_i2i_sim</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">i</span><span class="p">,</span> <span class="p">{})</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">j</span><span class="p">,</span> <span class="kc">None</span><span class="p">)</span> <span class="ow">is</span> <span class="ow">not</span> <span class="kc">None</span><span class="p">:</span>
                <span class="n">content_weight</span> <span class="o">+=</span> <span class="n">emb_i2i_sim</span><span class="p">[</span><span class="n">i</span><span class="p">][</span><span class="n">j</span><span class="p">]</span>
            <span class="k">if</span> <span class="n">emb_i2i_sim</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">j</span><span class="p">,</span> <span class="p">{})</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">i</span><span class="p">,</span> <span class="kc">None</span><span class="p">)</span> <span class="ow">is</span> <span class="ow">not</span> <span class="kc">None</span><span class="p">:</span>
                <span class="n">content_weight</span> <span class="o">+=</span> <span class="n">emb_i2i_sim</span><span class="p">[</span><span class="n">j</span><span class="p">][</span><span class="n">i</span><span class="p">]</span>

            <span class="n">item_rank</span><span class="o">.</span><span class="n">setdefault</span><span class="p">(</span><span class="n">j</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>
            <span class="n">item_rank</span><span class="p">[</span><span class="n">j</span><span class="p">]</span> <span class="o">+=</span> <span class="n">created_time_weight</span> <span class="o">*</span> <span class="n">loc_weight</span> <span class="o">*</span> <span class="n">content_weight</span> <span class="o">*</span> <span class="n">wij</span>

    <span class="c1"># 不足10个，用热门商品补全</span>
    <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">item_rank</span><span class="p">)</span> <span class="o">&lt;</span> <span class="n">recall_item_num</span><span class="p">:</span>
        <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">item</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">item_topk_click</span><span class="p">):</span>
            <span class="k">if</span> <span class="n">item</span> <span class="ow">in</span> <span class="n">item_rank</span><span class="o">.</span><span class="n">items</span><span class="p">():</span> <span class="c1"># 填充的item应该不在原来的列表中</span>
                <span class="k">continue</span>
            <span class="n">item_rank</span><span class="p">[</span><span class="n">item</span><span class="p">]</span> <span class="o">=</span> <span class="o">-</span> <span class="n">i</span> <span class="o">-</span> <span class="mi">100</span> <span class="c1"># 随便给个负数就行</span>
            <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">item_rank</span><span class="p">)</span> <span class="o">==</span> <span class="n">recall_item_num</span><span class="p">:</span>
                <span class="k">break</span>

    <span class="n">item_rank</span> <span class="o">=</span> <span class="nb">sorted</span><span class="p">(</span><span class="n">item_rank</span><span class="o">.</span><span class="n">items</span><span class="p">(),</span> <span class="n">key</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">reverse</span><span class="o">=</span><span class="kc">True</span><span class="p">)[:</span><span class="n">recall_item_num</span><span class="p">]</span>

    <span class="k">return</span> <span class="n">item_rank</span>
</pre></div>
</div>
<section id="itemcf-sim">
<h4><span class="section-number">6.4.4.2.1. </span>itemCF sim召回<a class="headerlink" href="#itemcf-sim" title="Permalink to this heading">¶</a></h4>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="c1"># 先进行itemcf召回, 为了召回评估，所以提取最后一次点击</span>

<span class="k">if</span> <span class="n">metric_recall</span><span class="p">:</span>
    <span class="n">trn_hist_click_df</span><span class="p">,</span> <span class="n">trn_last_click_df</span> <span class="o">=</span> <span class="n">get_hist_and_last_click</span><span class="p">(</span><span class="n">all_click_df</span><span class="p">)</span>
<span class="k">else</span><span class="p">:</span>
    <span class="n">trn_hist_click_df</span> <span class="o">=</span> <span class="n">all_click_df</span>

<span class="n">user_recall_items_dict</span> <span class="o">=</span> <span class="n">defaultdict</span><span class="p">(</span><span class="nb">dict</span><span class="p">)</span>
<span class="n">user_item_time_dict</span> <span class="o">=</span> <span class="n">get_user_item_time</span><span class="p">(</span><span class="n">trn_hist_click_df</span><span class="p">)</span>

<span class="n">i2i_sim</span> <span class="o">=</span> <span class="n">pickle</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="nb">open</span><span class="p">(</span><span class="n">save_path</span> <span class="o">/</span> <span class="s1">&#39;itemcf_i2i_sim.pkl&#39;</span><span class="p">,</span> <span class="s1">&#39;rb&#39;</span><span class="p">))</span>
<span class="n">emb_i2i_sim</span> <span class="o">=</span> <span class="n">pickle</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="nb">open</span><span class="p">(</span><span class="n">save_path</span> <span class="o">/</span> <span class="s1">&#39;emb_i2i_sim.pkl&#39;</span><span class="p">,</span> <span class="s1">&#39;rb&#39;</span><span class="p">))</span>

<span class="n">sim_item_topk</span> <span class="o">=</span> <span class="mi">20</span>
<span class="n">recall_item_num</span> <span class="o">=</span> <span class="mi">10</span>
<span class="n">item_topk_click</span> <span class="o">=</span> <span class="n">get_item_topk_click</span><span class="p">(</span><span class="n">trn_hist_click_df</span><span class="p">,</span> <span class="n">k</span><span class="o">=</span><span class="mi">50</span><span class="p">)</span>

<span class="k">for</span> <span class="n">user</span> <span class="ow">in</span> <span class="n">tqdm</span><span class="p">(</span><span class="n">trn_hist_click_df</span><span class="p">[</span><span class="s1">&#39;user_id&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">unique</span><span class="p">(),</span> <span class="n">disable</span><span class="o">=</span><span class="ow">not</span> <span class="n">logger</span><span class="o">.</span><span class="n">isEnabledFor</span><span class="p">(</span><span class="n">logging</span><span class="o">.</span><span class="n">DEBUG</span><span class="p">)):</span>
    <span class="n">user_recall_items_dict</span><span class="p">[</span><span class="n">user</span><span class="p">]</span> <span class="o">=</span> <span class="n">item_based_recommend</span><span class="p">(</span><span class="n">user</span><span class="p">,</span> <span class="n">user_item_time_dict</span><span class="p">,</span> \
                                                        <span class="n">i2i_sim</span><span class="p">,</span> <span class="n">sim_item_topk</span><span class="p">,</span> <span class="n">recall_item_num</span><span class="p">,</span> \
                                                        <span class="n">item_topk_click</span><span class="p">,</span> <span class="n">item_created_time_dict</span><span class="p">,</span> <span class="n">emb_i2i_sim</span><span class="p">)</span>

<span class="n">user_multi_recall_dict</span><span class="p">[</span><span class="s1">&#39;itemcf_sim_itemcf_recall&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">user_recall_items_dict</span>
<span class="n">pickle</span><span class="o">.</span><span class="n">dump</span><span class="p">(</span><span class="n">user_multi_recall_dict</span><span class="p">[</span><span class="s1">&#39;itemcf_sim_itemcf_recall&#39;</span><span class="p">],</span> <span class="nb">open</span><span class="p">(</span><span class="n">save_path</span> <span class="o">/</span> <span class="s1">&#39;itemcf_recall_dict.pkl&#39;</span><span class="p">,</span> <span class="s1">&#39;wb&#39;</span><span class="p">))</span>

<span class="k">if</span> <span class="n">metric_recall</span><span class="p">:</span>
    <span class="c1"># 召回效果评估</span>
    <span class="n">metrics_recall</span><span class="p">(</span><span class="n">user_multi_recall_dict</span><span class="p">[</span><span class="s1">&#39;itemcf_sim_itemcf_recall&#39;</span><span class="p">],</span> <span class="n">trn_last_click_df</span><span class="p">,</span> <span class="n">topk</span><span class="o">=</span><span class="n">recall_item_num</span><span class="p">)</span>
</pre></div>
</div>
</section>
<section id="embedding-sim">
<h4><span class="section-number">6.4.4.2.2. </span>embedding sim 召回<a class="headerlink" href="#embedding-sim" title="Permalink to this heading">¶</a></h4>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="c1"># 这里是为了召回评估，所以提取最后一次点击</span>
<span class="k">if</span> <span class="n">metric_recall</span><span class="p">:</span>
    <span class="n">trn_hist_click_df</span><span class="p">,</span> <span class="n">trn_last_click_df</span> <span class="o">=</span> <span class="n">get_hist_and_last_click</span><span class="p">(</span><span class="n">all_click_df</span><span class="p">)</span>
<span class="k">else</span><span class="p">:</span>
    <span class="n">trn_hist_click_df</span> <span class="o">=</span> <span class="n">all_click_df</span>

<span class="n">user_recall_items_dict</span> <span class="o">=</span> <span class="n">defaultdict</span><span class="p">(</span><span class="nb">dict</span><span class="p">)</span>
<span class="n">user_item_time_dict</span> <span class="o">=</span> <span class="n">get_user_item_time</span><span class="p">(</span><span class="n">trn_hist_click_df</span><span class="p">)</span>
<span class="n">i2i_sim</span> <span class="o">=</span> <span class="n">pickle</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="nb">open</span><span class="p">(</span><span class="n">save_path</span> <span class="o">/</span> <span class="s1">&#39;emb_i2i_sim.pkl&#39;</span><span class="p">,</span><span class="s1">&#39;rb&#39;</span><span class="p">))</span>

<span class="n">sim_item_topk</span> <span class="o">=</span> <span class="mi">20</span>
<span class="n">recall_item_num</span> <span class="o">=</span> <span class="mi">10</span>

<span class="n">item_topk_click</span> <span class="o">=</span> <span class="n">get_item_topk_click</span><span class="p">(</span><span class="n">trn_hist_click_df</span><span class="p">,</span> <span class="n">k</span><span class="o">=</span><span class="mi">50</span><span class="p">)</span>

<span class="k">for</span> <span class="n">user</span> <span class="ow">in</span> <span class="n">tqdm</span><span class="p">(</span><span class="n">trn_hist_click_df</span><span class="p">[</span><span class="s1">&#39;user_id&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">unique</span><span class="p">(),</span> <span class="n">disable</span><span class="o">=</span><span class="ow">not</span> <span class="n">logger</span><span class="o">.</span><span class="n">isEnabledFor</span><span class="p">(</span><span class="n">logging</span><span class="o">.</span><span class="n">DEBUG</span><span class="p">)):</span>
    <span class="n">user_recall_items_dict</span><span class="p">[</span><span class="n">user</span><span class="p">]</span> <span class="o">=</span> <span class="n">item_based_recommend</span><span class="p">(</span><span class="n">user</span><span class="p">,</span> <span class="n">user_item_time_dict</span><span class="p">,</span> <span class="n">i2i_sim</span><span class="p">,</span> <span class="n">sim_item_topk</span><span class="p">,</span>
                                                        <span class="n">recall_item_num</span><span class="p">,</span> <span class="n">item_topk_click</span><span class="p">,</span> <span class="n">item_created_time_dict</span><span class="p">,</span> <span class="n">emb_i2i_sim</span><span class="p">)</span>

<span class="n">user_multi_recall_dict</span><span class="p">[</span><span class="s1">&#39;embedding_sim_item_recall&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">user_recall_items_dict</span>
<span class="n">pickle</span><span class="o">.</span><span class="n">dump</span><span class="p">(</span><span class="n">user_multi_recall_dict</span><span class="p">[</span><span class="s1">&#39;embedding_sim_item_recall&#39;</span><span class="p">],</span> <span class="nb">open</span><span class="p">(</span><span class="n">save_path</span> <span class="o">/</span> <span class="s1">&#39;embedding_sim_item_recall.pkl&#39;</span><span class="p">,</span> <span class="s1">&#39;wb&#39;</span><span class="p">))</span>

<span class="k">if</span> <span class="n">metric_recall</span><span class="p">:</span>
    <span class="c1"># 召回效果评估</span>
    <span class="n">metrics_recall</span><span class="p">(</span><span class="n">user_multi_recall_dict</span><span class="p">[</span><span class="s1">&#39;embedding_sim_item_recall&#39;</span><span class="p">],</span> <span class="n">trn_last_click_df</span><span class="p">,</span> <span class="n">topk</span><span class="o">=</span><span class="n">recall_item_num</span><span class="p">)</span>
</pre></div>
</div>
</section>
</section>
<section id="usercf">
<h3><span class="section-number">6.4.4.3. </span>userCF召回<a class="headerlink" href="#usercf" title="Permalink to this heading">¶</a></h3>
<p>基于用户协同过滤，核心思想是给用户推荐与其相似的用户历史点击文章，因为这里涉及到了相似用户的历史文章，这里仍然可以加上一些关联规则来给用户可能点击的文章进行加权，这里使用的关联规则主要是考虑相似用户的历史点击文章与被推荐用户历史点击商品的关系权重，而这里的关系就可以直接借鉴基于物品的协同过滤相似的做法，只不过这里是对被推荐物品关系的一个累加的过程，下面是使用的一些关系权重，及相关的代码：</p>
<ol class="arabic simple">
<li><p>计算被推荐用户历史点击文章与相似用户历史点击文章的相似度，文章创建时间差，相对位置的总和，作为各自的权重</p></li>
</ol>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="c1"># 基于用户的召回 u2u2i</span>
<span class="k">def</span><span class="w"> </span><span class="nf">user_based_recommend</span><span class="p">(</span><span class="n">user_id</span><span class="p">,</span> <span class="n">user_item_time_dict</span><span class="p">,</span> <span class="n">u2u_sim</span><span class="p">,</span> <span class="n">sim_user_topk</span><span class="p">,</span> <span class="n">recall_item_num</span><span class="p">,</span>
                         <span class="n">item_topk_click</span><span class="p">,</span> <span class="n">item_created_time_dict</span><span class="p">,</span> <span class="n">emb_i2i_sim</span><span class="p">):</span>
<span class="w">    </span><span class="sd">&quot;&quot;&quot;</span>
<span class="sd">        基于文章协同过滤的召回</span>
<span class="sd">        :param user_id: 用户id</span>
<span class="sd">        :param user_item_time_dict: 字典, 根据点击时间获取用户的点击文章序列   {user1: {item1: time1, item2: time2..}...}</span>
<span class="sd">        :param u2u_sim: 字典，文章相似性矩阵</span>
<span class="sd">        :param sim_user_topk: 整数， 选择与当前用户最相似的前k个用户</span>
<span class="sd">        :param recall_item_num: 整数， 最后的召回文章数量</span>
<span class="sd">        :param item_topk_click: 列表，点击次数最多的文章列表，用户召回补全</span>
<span class="sd">        :param item_created_time_dict: 文章创建时间列表</span>
<span class="sd">        :param emb_i2i_sim: 字典基于内容embedding算的文章相似矩阵</span>

<span class="sd">        return: 召回的文章列表 {item1:score1, item2: score2...}</span>
<span class="sd">    &quot;&quot;&quot;</span>
    <span class="c1"># 历史交互</span>
    <span class="n">user_item_time_list</span> <span class="o">=</span> <span class="n">user_item_time_dict</span><span class="p">[</span><span class="n">user_id</span><span class="p">]</span>    <span class="c1"># {item1: time1, item2: time2...}</span>
    <span class="n">user_hist_items</span> <span class="o">=</span> <span class="nb">set</span><span class="p">([</span><span class="n">i</span> <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">t</span> <span class="ow">in</span> <span class="n">user_item_time_list</span><span class="p">])</span>   <span class="c1"># 存在一个用户与某篇文章的多次交互， 这里得去重</span>

    <span class="n">items_rank</span> <span class="o">=</span> <span class="p">{}</span>
    <span class="k">for</span> <span class="n">sim_u</span><span class="p">,</span> <span class="n">wuv</span> <span class="ow">in</span> <span class="nb">sorted</span><span class="p">(</span><span class="n">u2u_sim</span><span class="p">[</span><span class="n">user_id</span><span class="p">]</span><span class="o">.</span><span class="n">items</span><span class="p">(),</span> <span class="n">key</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">reverse</span><span class="o">=</span><span class="kc">True</span><span class="p">)[:</span><span class="n">sim_user_topk</span><span class="p">]:</span>
        <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">click_time</span> <span class="ow">in</span> <span class="n">user_item_time_dict</span><span class="p">[</span><span class="n">sim_u</span><span class="p">]:</span>
            <span class="k">if</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">user_hist_items</span><span class="p">:</span>
                <span class="k">continue</span>
            <span class="n">items_rank</span><span class="o">.</span><span class="n">setdefault</span><span class="p">(</span><span class="n">i</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>

            <span class="n">loc_weight</span> <span class="o">=</span> <span class="mf">1.0</span>
            <span class="n">content_weight</span> <span class="o">=</span> <span class="mf">1.0</span>
            <span class="n">created_time_weight</span> <span class="o">=</span> <span class="mf">1.0</span>

            <span class="c1"># 当前文章与该用户看的历史文章进行一个权重交互</span>
            <span class="k">for</span> <span class="n">loc</span><span class="p">,</span> <span class="p">(</span><span class="n">j</span><span class="p">,</span> <span class="n">click_time</span><span class="p">)</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">user_item_time_list</span><span class="p">):</span>
                <span class="c1"># 点击时的相对位置权重</span>
                <span class="n">loc_weight</span> <span class="o">+=</span> <span class="mf">0.9</span> <span class="o">**</span> <span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">user_item_time_list</span><span class="p">)</span> <span class="o">-</span> <span class="n">loc</span><span class="p">)</span>
                <span class="c1"># 内容相似性权重</span>
                <span class="k">if</span> <span class="n">emb_i2i_sim</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">i</span><span class="p">,</span> <span class="p">{})</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">j</span><span class="p">,</span> <span class="kc">None</span><span class="p">)</span> <span class="ow">is</span> <span class="ow">not</span> <span class="kc">None</span><span class="p">:</span>
                    <span class="n">content_weight</span> <span class="o">+=</span> <span class="n">emb_i2i_sim</span><span class="p">[</span><span class="n">i</span><span class="p">][</span><span class="n">j</span><span class="p">]</span>
                <span class="k">if</span> <span class="n">emb_i2i_sim</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">j</span><span class="p">,</span> <span class="p">{})</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">i</span><span class="p">,</span> <span class="kc">None</span><span class="p">)</span> <span class="ow">is</span> <span class="ow">not</span> <span class="kc">None</span><span class="p">:</span>
                    <span class="n">content_weight</span> <span class="o">+=</span> <span class="n">emb_i2i_sim</span><span class="p">[</span><span class="n">j</span><span class="p">][</span><span class="n">i</span><span class="p">]</span>

                <span class="c1"># 创建时间差权重</span>
                <span class="n">created_time_weight</span> <span class="o">+=</span> <span class="n">np</span><span class="o">.</span><span class="n">exp</span><span class="p">(</span><span class="mf">0.8</span> <span class="o">*</span> <span class="n">np</span><span class="o">.</span><span class="n">abs</span><span class="p">(</span><span class="n">item_created_time_dict</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">-</span> <span class="n">item_created_time_dict</span><span class="p">[</span><span class="n">j</span><span class="p">]))</span>

            <span class="n">items_rank</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">+=</span> <span class="n">loc_weight</span> <span class="o">*</span> <span class="n">content_weight</span> <span class="o">*</span> <span class="n">created_time_weight</span> <span class="o">*</span> <span class="n">wuv</span>

    <span class="c1"># 热度补全</span>
    <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">items_rank</span><span class="p">)</span> <span class="o">&lt;</span> <span class="n">recall_item_num</span><span class="p">:</span>
        <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">item</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">item_topk_click</span><span class="p">):</span>
            <span class="k">if</span> <span class="n">item</span> <span class="ow">in</span> <span class="n">items_rank</span><span class="o">.</span><span class="n">items</span><span class="p">():</span> <span class="c1"># 填充的item应该不在原来的列表中</span>
                <span class="k">continue</span>
            <span class="n">items_rank</span><span class="p">[</span><span class="n">item</span><span class="p">]</span> <span class="o">=</span> <span class="o">-</span> <span class="n">i</span> <span class="o">-</span> <span class="mi">100</span> <span class="c1"># 随便给个复数就行</span>
            <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">items_rank</span><span class="p">)</span> <span class="o">==</span> <span class="n">recall_item_num</span><span class="p">:</span>
                <span class="k">break</span>

    <span class="n">items_rank</span> <span class="o">=</span> <span class="nb">sorted</span><span class="p">(</span><span class="n">items_rank</span><span class="o">.</span><span class="n">items</span><span class="p">(),</span> <span class="n">key</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">reverse</span><span class="o">=</span><span class="kc">True</span><span class="p">)[:</span><span class="n">recall_item_num</span><span class="p">]</span>

    <span class="k">return</span> <span class="n">items_rank</span>
</pre></div>
</div>
<section id="usercf-sim">
<h4><span class="section-number">6.4.4.3.1. </span>userCF sim召回<a class="headerlink" href="#usercf-sim" title="Permalink to this heading">¶</a></h4>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="c1"># 这里是为了召回评估，所以提取最后一次点击</span>
<span class="c1"># 由于usercf中计算user之间的相似度的过程太费内存了，全量数据这里就没有跑，跑了一个采样之后的数据</span>
<span class="k">if</span> <span class="n">metric_recall</span><span class="p">:</span>
    <span class="n">trn_hist_click_df</span><span class="p">,</span> <span class="n">trn_last_click_df</span> <span class="o">=</span> <span class="n">get_hist_and_last_click</span><span class="p">(</span><span class="n">all_click_df</span><span class="p">)</span>
<span class="k">else</span><span class="p">:</span>
    <span class="n">trn_hist_click_df</span> <span class="o">=</span> <span class="n">all_click_df</span>

<span class="n">user_recall_items_dict</span> <span class="o">=</span> <span class="n">defaultdict</span><span class="p">(</span><span class="nb">dict</span><span class="p">)</span>
<span class="n">user_item_time_dict</span> <span class="o">=</span> <span class="n">get_user_item_time</span><span class="p">(</span><span class="n">trn_hist_click_df</span><span class="p">)</span>

<span class="n">u2u_sim</span> <span class="o">=</span> <span class="n">pickle</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="nb">open</span><span class="p">(</span><span class="n">save_path</span> <span class="o">/</span> <span class="s1">&#39;usercf_u2u_sim.pkl&#39;</span><span class="p">,</span> <span class="s1">&#39;rb&#39;</span><span class="p">))</span>

<span class="n">sim_user_topk</span> <span class="o">=</span> <span class="mi">20</span>
<span class="n">recall_item_num</span> <span class="o">=</span> <span class="mi">10</span>
<span class="n">item_topk_click</span> <span class="o">=</span> <span class="n">get_item_topk_click</span><span class="p">(</span><span class="n">trn_hist_click_df</span><span class="p">,</span> <span class="n">k</span><span class="o">=</span><span class="mi">50</span><span class="p">)</span>

<span class="k">for</span> <span class="n">user</span> <span class="ow">in</span> <span class="n">tqdm</span><span class="p">(</span><span class="n">trn_hist_click_df</span><span class="p">[</span><span class="s1">&#39;user_id&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">unique</span><span class="p">(),</span> <span class="n">disable</span><span class="o">=</span><span class="ow">not</span> <span class="n">logger</span><span class="o">.</span><span class="n">isEnabledFor</span><span class="p">(</span><span class="n">logging</span><span class="o">.</span><span class="n">DEBUG</span><span class="p">)):</span>
    <span class="n">user_recall_items_dict</span><span class="p">[</span><span class="n">user</span><span class="p">]</span> <span class="o">=</span> <span class="n">user_based_recommend</span><span class="p">(</span><span class="n">user</span><span class="p">,</span> <span class="n">user_item_time_dict</span><span class="p">,</span> <span class="n">u2u_sim</span><span class="p">,</span> <span class="n">sim_user_topk</span><span class="p">,</span> \
                                                        <span class="n">recall_item_num</span><span class="p">,</span> <span class="n">item_topk_click</span><span class="p">,</span> <span class="n">item_created_time_dict</span><span class="p">,</span> <span class="n">emb_i2i_sim</span><span class="p">)</span>

<span class="n">pickle</span><span class="o">.</span><span class="n">dump</span><span class="p">(</span><span class="n">user_recall_items_dict</span><span class="p">,</span> <span class="nb">open</span><span class="p">(</span><span class="n">save_path</span> <span class="o">/</span> <span class="s1">&#39;usercf_u2u2i_recall.pkl&#39;</span><span class="p">,</span> <span class="s1">&#39;wb&#39;</span><span class="p">))</span>

<span class="k">if</span> <span class="n">metric_recall</span><span class="p">:</span>
    <span class="c1"># 召回效果评估</span>
    <span class="n">metrics_recall</span><span class="p">(</span><span class="n">user_recall_items_dict</span><span class="p">,</span> <span class="n">trn_last_click_df</span><span class="p">,</span> <span class="n">topk</span><span class="o">=</span><span class="n">recall_item_num</span><span class="p">)</span>
</pre></div>
</div>
</section>
<section id="user-embedding-sim">
<h4><span class="section-number">6.4.4.3.2. </span>user embedding sim召回<a class="headerlink" href="#user-embedding-sim" title="Permalink to this heading">¶</a></h4>
<p>虽然没有直接跑usercf的计算用户之间的相似度，为了验证上述基于用户的协同过滤的代码，下面使用了YoutubeDNN过程中产生的user
embedding来进行向量检索每个user最相似的topk个user，在使用这里得到的u2u的相似性矩阵，使用usercf进行召回，具体代码如下</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="c1"># 使用Embedding的方式获取u2u的相似性矩阵</span>
<span class="c1"># topk指的是每个user, faiss搜索后返回最相似的topk个user</span>
<span class="k">def</span><span class="w"> </span><span class="nf">u2u_embdding_sim</span><span class="p">(</span><span class="n">click_df</span><span class="p">,</span> <span class="n">user_emb_dict</span><span class="p">,</span> <span class="n">save_path</span><span class="p">,</span> <span class="n">topk</span><span class="p">):</span>

    <span class="n">user_list</span> <span class="o">=</span> <span class="p">[]</span>
    <span class="n">user_emb_list</span> <span class="o">=</span> <span class="p">[]</span>
    <span class="k">for</span> <span class="n">user_id</span><span class="p">,</span> <span class="n">user_emb</span> <span class="ow">in</span> <span class="n">user_emb_dict</span><span class="o">.</span><span class="n">items</span><span class="p">():</span>
        <span class="n">user_list</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">user_id</span><span class="p">)</span>
        <span class="n">user_emb_list</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">user_emb</span><span class="p">)</span>

    <span class="n">user_index_2_rawid_dict</span> <span class="o">=</span> <span class="p">{</span><span class="n">k</span><span class="p">:</span> <span class="n">v</span> <span class="k">for</span> <span class="n">k</span><span class="p">,</span> <span class="n">v</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">user_list</span><span class="p">)),</span> <span class="n">user_list</span><span class="p">)}</span>

    <span class="n">user_emb_np</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">user_emb_list</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">float32</span><span class="p">)</span>

    <span class="c1"># 建立faiss索引</span>
    <span class="n">user_index</span> <span class="o">=</span> <span class="n">faiss</span><span class="o">.</span><span class="n">IndexFlatIP</span><span class="p">(</span><span class="n">user_emb_np</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span>
    <span class="n">user_index</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">user_emb_np</span><span class="p">)</span>
    <span class="c1"># 相似度查询，给每个索引位置上的向量返回topk个item以及相似度</span>
    <span class="n">sim</span><span class="p">,</span> <span class="n">idx</span> <span class="o">=</span> <span class="n">user_index</span><span class="o">.</span><span class="n">search</span><span class="p">(</span><span class="n">user_emb_np</span><span class="p">,</span> <span class="n">topk</span><span class="p">)</span> <span class="c1"># 返回的是列表</span>

    <span class="c1"># 将向量检索的结果保存成原始id的对应关系</span>
    <span class="n">user_sim_dict</span> <span class="o">=</span> <span class="n">defaultdict</span><span class="p">(</span><span class="nb">dict</span><span class="p">)</span>
    <span class="k">for</span> <span class="n">target_idx</span><span class="p">,</span> <span class="n">sim_value_list</span><span class="p">,</span> <span class="n">rele_idx_list</span> <span class="ow">in</span> <span class="n">tqdm</span><span class="p">(</span><span class="nb">zip</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">user_emb_np</span><span class="p">)),</span> <span class="n">sim</span><span class="p">,</span> <span class="n">idx</span><span class="p">),</span> <span class="n">disable</span><span class="o">=</span><span class="ow">not</span> <span class="n">logger</span><span class="o">.</span><span class="n">isEnabledFor</span><span class="p">(</span><span class="n">logging</span><span class="o">.</span><span class="n">DEBUG</span><span class="p">)):</span>
        <span class="n">target_raw_id</span> <span class="o">=</span> <span class="n">user_index_2_rawid_dict</span><span class="p">[</span><span class="n">target_idx</span><span class="p">]</span>
        <span class="c1"># 从1开始是为了去掉商品本身, 所以最终获得的相似商品只有topk-1</span>
        <span class="k">for</span> <span class="n">rele_idx</span><span class="p">,</span> <span class="n">sim_value</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">rele_idx_list</span><span class="p">[</span><span class="mi">1</span><span class="p">:],</span> <span class="n">sim_value_list</span><span class="p">[</span><span class="mi">1</span><span class="p">:]):</span>
            <span class="n">rele_raw_id</span> <span class="o">=</span> <span class="n">user_index_2_rawid_dict</span><span class="p">[</span><span class="n">rele_idx</span><span class="p">]</span>
            <span class="n">user_sim_dict</span><span class="p">[</span><span class="n">target_raw_id</span><span class="p">][</span><span class="n">rele_raw_id</span><span class="p">]</span> <span class="o">=</span> <span class="n">user_sim_dict</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">target_raw_id</span><span class="p">,</span> <span class="p">{})</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">rele_raw_id</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span> <span class="o">+</span> <span class="n">sim_value</span>

    <span class="c1"># 保存i2i相似度矩阵</span>
    <span class="n">pickle</span><span class="o">.</span><span class="n">dump</span><span class="p">(</span><span class="n">user_sim_dict</span><span class="p">,</span> <span class="nb">open</span><span class="p">(</span><span class="n">save_path</span> <span class="o">/</span> <span class="s1">&#39;youtube_u2u_sim.pkl&#39;</span><span class="p">,</span> <span class="s1">&#39;wb&#39;</span><span class="p">))</span>
    <span class="k">return</span> <span class="n">user_sim_dict</span>
</pre></div>
</div>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="c1"># 读取YoutubeDNN过程中产生的user embedding, 然后使用faiss计算用户之间的相似度</span>
<span class="c1"># 这里需要注意，这里得到的user embedding其实并不是很好，因为YoutubeDNN中使用的是用户点击序列来训练的user embedding,</span>
<span class="c1"># 如果序列普遍都比较短的话，其实效果并不是很好</span>
<span class="n">user_emb_dict</span> <span class="o">=</span> <span class="n">pickle</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="nb">open</span><span class="p">(</span><span class="n">save_path</span> <span class="o">/</span> <span class="s1">&#39;user_youtube_emb.pkl&#39;</span><span class="p">,</span> <span class="s1">&#39;rb&#39;</span><span class="p">))</span>
<span class="n">u2u_sim</span> <span class="o">=</span> <span class="n">u2u_embdding_sim</span><span class="p">(</span><span class="n">all_click_df</span><span class="p">,</span> <span class="n">user_emb_dict</span><span class="p">,</span> <span class="n">save_path</span><span class="p">,</span> <span class="n">topk</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span>
</pre></div>
</div>
<p>通过YoutubeDNN得到的user_embedding</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="c1"># 使用召回评估函数验证当前召回方式的效果</span>
<span class="k">if</span> <span class="n">metric_recall</span><span class="p">:</span>
    <span class="n">trn_hist_click_df</span><span class="p">,</span> <span class="n">trn_last_click_df</span> <span class="o">=</span> <span class="n">get_hist_and_last_click</span><span class="p">(</span><span class="n">all_click_df</span><span class="p">)</span>
<span class="k">else</span><span class="p">:</span>
    <span class="n">trn_hist_click_df</span> <span class="o">=</span> <span class="n">all_click_df</span>

<span class="n">user_recall_items_dict</span> <span class="o">=</span> <span class="n">defaultdict</span><span class="p">(</span><span class="nb">dict</span><span class="p">)</span>
<span class="n">user_item_time_dict</span> <span class="o">=</span> <span class="n">get_user_item_time</span><span class="p">(</span><span class="n">trn_hist_click_df</span><span class="p">)</span>
<span class="n">u2u_sim</span> <span class="o">=</span> <span class="n">pickle</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="nb">open</span><span class="p">(</span><span class="n">save_path</span> <span class="o">/</span> <span class="s1">&#39;youtube_u2u_sim.pkl&#39;</span><span class="p">,</span> <span class="s1">&#39;rb&#39;</span><span class="p">))</span>

<span class="n">sim_user_topk</span> <span class="o">=</span> <span class="mi">20</span>
<span class="n">recall_item_num</span> <span class="o">=</span> <span class="mi">10</span>

<span class="n">item_topk_click</span> <span class="o">=</span> <span class="n">get_item_topk_click</span><span class="p">(</span><span class="n">trn_hist_click_df</span><span class="p">,</span> <span class="n">k</span><span class="o">=</span><span class="mi">50</span><span class="p">)</span>
<span class="k">for</span> <span class="n">user</span> <span class="ow">in</span> <span class="n">tqdm</span><span class="p">(</span><span class="n">trn_hist_click_df</span><span class="p">[</span><span class="s1">&#39;user_id&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">unique</span><span class="p">(),</span> <span class="n">disable</span><span class="o">=</span><span class="ow">not</span> <span class="n">logger</span><span class="o">.</span><span class="n">isEnabledFor</span><span class="p">(</span><span class="n">logging</span><span class="o">.</span><span class="n">DEBUG</span><span class="p">)):</span>
    <span class="n">user_recall_items_dict</span><span class="p">[</span><span class="n">user</span><span class="p">]</span> <span class="o">=</span> <span class="n">user_based_recommend</span><span class="p">(</span><span class="n">user</span><span class="p">,</span> <span class="n">user_item_time_dict</span><span class="p">,</span> <span class="n">u2u_sim</span><span class="p">,</span> <span class="n">sim_user_topk</span><span class="p">,</span> \
                                                        <span class="n">recall_item_num</span><span class="p">,</span> <span class="n">item_topk_click</span><span class="p">,</span> <span class="n">item_created_time_dict</span><span class="p">,</span> <span class="n">emb_i2i_sim</span><span class="p">)</span>

<span class="n">user_multi_recall_dict</span><span class="p">[</span><span class="s1">&#39;youtubednn_usercf_recall&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">user_recall_items_dict</span>
<span class="n">pickle</span><span class="o">.</span><span class="n">dump</span><span class="p">(</span><span class="n">user_multi_recall_dict</span><span class="p">[</span><span class="s1">&#39;youtubednn_usercf_recall&#39;</span><span class="p">],</span> <span class="nb">open</span><span class="p">(</span><span class="n">save_path</span> <span class="o">/</span> <span class="s1">&#39;youtubednn_usercf_recall.pkl&#39;</span><span class="p">,</span> <span class="s1">&#39;wb&#39;</span><span class="p">))</span>

<span class="k">if</span> <span class="n">metric_recall</span><span class="p">:</span>
    <span class="c1"># 召回效果评估</span>
    <span class="n">metrics_recall</span><span class="p">(</span><span class="n">user_multi_recall_dict</span><span class="p">[</span><span class="s1">&#39;youtubednn_usercf_recall&#39;</span><span class="p">],</span> <span class="n">trn_last_click_df</span><span class="p">,</span> <span class="n">topk</span><span class="o">=</span><span class="n">recall_item_num</span><span class="p">)</span>
</pre></div>
</div>
</section>
</section>
</section>
<section id="id13">
<h2><span class="section-number">6.4.5. </span>冷启动问题<a class="headerlink" href="#id13" title="Permalink to this heading">¶</a></h2>
<p><strong>冷启动问题可以分成三类：文章冷启动，用户冷启动，系统冷启动。</strong></p>
<ul class="simple">
<li><p>文章冷启动：对于一个平台系统新加入的文章，该文章没有任何的交互记录，如何推荐给用户的问题。(对于我们场景可以认为是，日志数据中没有出现过的文章都可以认为是冷启动的文章)</p></li>
<li><p>用户冷启动：对于一个平台系统新来的用户，该用户还没有文章的交互信息，如何给该用户进行推荐。(对于我们场景就是，测试集中的用户是否在测试集对应的log数据中出现过，如果没有出现过，那么可以认为该用户是冷启动用户。但是有时候并没有这么严格，我们也可以自己设定某些指标来判别哪些用户是冷启动用户，比如通过使用时长，点击率，留存率等等)</p></li>
<li><p>系统冷启动：就是对于一个平台刚上线，还没有任何的相关历史数据，此时就是系统冷启动，其实也就是前面两种的一个综合。</p></li>
</ul>
<p><strong>当前场景下冷启动问题的分析：</strong></p>
<p>对当前的数据进行分析会发现，日志中所有出现过的点击文章只有3w多个，而整个文章库中却有30多万，那么测试集中的用户最后一次点击是否会点击没有出现在日志中的文章呢？如果存在这种情况，说明用户点击的文章之前没有任何的交互信息，这也就是我们所说的文章冷启动。通过数据分析还可以发现，测试集用户只有一次点击的数据占得比例还不少，其实仅仅通过用户的一次点击就给用户推荐文章使用模型的方式也是比较难的，这里其实也可以考虑用户冷启动的问题，但是这里只给出物品冷启动的一些解决方案及代码，关于用户冷启动的话提一些可行性的做法。</p>
<ol class="arabic simple">
<li><p>文章冷启动(没有冷启动的探索问题)
其实我们这里不是为了做文章的冷启动而做冷启动，而是猜测用户可能会点击一些没有在log数据中出现的文章，我们要做的就是如何从将近27万的文章中选择一些文章作为用户冷启动的文章，这里其实也可以看成是一种召回策略，我们这里就采用简单的比较好理解的基于规则的召回策略来获取用户可能点击的未出现在log数据中的文章。
现在的问题变成了：如何给每个用户考虑从27万个商品中获取一小部分商品？随机选一些可能是一种方案。下面给出一些参考的方案。</p>
<ol class="arabic simple">
<li><p>首先基于Embedding召回一部分与用户历史相似的文章</p></li>
<li><p>从基于Embedding召回的文章中通过一些规则过滤掉一些文章，使得留下的文章用户更可能点击。我们这里的规则，可以是，留下那些与用户历史点击文章主题相同的文章，或者字数相差不大的文章。并且留下的文章尽量是与测试集用户最后一次点击时间更接近的文章，或者是当天的文章也行。</p></li>
</ol>
</li>
<li><p>用户冷启动
这里对测试集中的用户点击数据进行分析会发现，测试集中有百分之20的用户只有一次点击，那么这些点击特别少的用户的召回是不是可以单独做一些策略上的补充呢？或者是在排序后直接基于规则加上一些文章呢？这些都可以去尝试，这里没有提供具体的做法。</p></li>
</ol>
<p><strong>注意：</strong></p>
<p>这里看似和基于embedding计算的item之间相似度然后做itemcf是一致的，但是现在我们的目的不一样，我们这里的目的是找到相似的向量，并且还没有出现在log日志中的商品，再加上一些其他的冷启动的策略，这里需要找回的数量会偏多一点，不然被筛选完之后可能都没有文章了</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="c1"># 先进行itemcf召回，这里不需要做召回评估，这里只是一种策略</span>
<span class="n">trn_hist_click_df</span> <span class="o">=</span> <span class="n">all_click_df</span>

<span class="n">user_recall_items_dict</span> <span class="o">=</span> <span class="n">defaultdict</span><span class="p">(</span><span class="nb">dict</span><span class="p">)</span>
<span class="n">user_item_time_dict</span> <span class="o">=</span> <span class="n">get_user_item_time</span><span class="p">(</span><span class="n">trn_hist_click_df</span><span class="p">)</span>
<span class="n">i2i_sim</span> <span class="o">=</span> <span class="n">pickle</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="nb">open</span><span class="p">(</span><span class="n">save_path</span> <span class="o">/</span> <span class="s1">&#39;emb_i2i_sim.pkl&#39;</span><span class="p">,</span><span class="s1">&#39;rb&#39;</span><span class="p">))</span>

<span class="n">sim_item_topk</span> <span class="o">=</span> <span class="mi">150</span>
<span class="n">recall_item_num</span> <span class="o">=</span> <span class="mi">100</span> <span class="c1"># 稍微召回多一点文章，便于后续的规则筛选</span>

<span class="n">item_topk_click</span> <span class="o">=</span> <span class="n">get_item_topk_click</span><span class="p">(</span><span class="n">trn_hist_click_df</span><span class="p">,</span> <span class="n">k</span><span class="o">=</span><span class="mi">50</span><span class="p">)</span>
<span class="k">for</span> <span class="n">user</span> <span class="ow">in</span> <span class="n">tqdm</span><span class="p">(</span><span class="n">trn_hist_click_df</span><span class="p">[</span><span class="s1">&#39;user_id&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">unique</span><span class="p">(),</span> <span class="n">disable</span><span class="o">=</span><span class="ow">not</span> <span class="n">logger</span><span class="o">.</span><span class="n">isEnabledFor</span><span class="p">(</span><span class="n">logging</span><span class="o">.</span><span class="n">DEBUG</span><span class="p">)):</span>
    <span class="n">user_recall_items_dict</span><span class="p">[</span><span class="n">user</span><span class="p">]</span> <span class="o">=</span> <span class="n">item_based_recommend</span><span class="p">(</span><span class="n">user</span><span class="p">,</span> <span class="n">user_item_time_dict</span><span class="p">,</span> <span class="n">i2i_sim</span><span class="p">,</span> <span class="n">sim_item_topk</span><span class="p">,</span>
                                                        <span class="n">recall_item_num</span><span class="p">,</span> <span class="n">item_topk_click</span><span class="p">,</span><span class="n">item_created_time_dict</span><span class="p">,</span> <span class="n">emb_i2i_sim</span><span class="p">)</span>
<span class="n">pickle</span><span class="o">.</span><span class="n">dump</span><span class="p">(</span><span class="n">user_recall_items_dict</span><span class="p">,</span> <span class="nb">open</span><span class="p">(</span><span class="n">save_path</span> <span class="o">/</span> <span class="s1">&#39;cold_start_items_raw_dict.pkl&#39;</span><span class="p">,</span> <span class="s1">&#39;wb&#39;</span><span class="p">))</span>
</pre></div>
</div>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="c1"># 基于规则进行文章过滤</span>
<span class="c1"># 保留文章主题与用户历史浏览主题相似的文章</span>
<span class="c1"># 保留文章字数与用户历史浏览文章字数相差不大的文章</span>
<span class="c1"># 保留最后一次点击当天的文章</span>
<span class="c1"># 按照相似度返回最终的结果</span>

<span class="k">def</span><span class="w"> </span><span class="nf">get_click_article_ids_set</span><span class="p">(</span><span class="n">all_click_df</span><span class="p">):</span>
    <span class="k">return</span> <span class="nb">set</span><span class="p">(</span><span class="n">all_click_df</span><span class="o">.</span><span class="n">click_article_id</span><span class="o">.</span><span class="n">values</span><span class="p">)</span>

<span class="k">def</span><span class="w"> </span><span class="nf">cold_start_items</span><span class="p">(</span><span class="n">user_recall_items_dict</span><span class="p">,</span> <span class="n">user_hist_item_typs_dict</span><span class="p">,</span> <span class="n">user_hist_item_words_dict</span><span class="p">,</span> \
                     <span class="n">user_last_item_created_time_dict</span><span class="p">,</span> <span class="n">item_type_dict</span><span class="p">,</span> <span class="n">item_words_dict</span><span class="p">,</span>
                     <span class="n">item_created_time_dict</span><span class="p">,</span> <span class="n">click_article_ids_set</span><span class="p">,</span> <span class="n">recall_item_num</span><span class="p">):</span>
<span class="w">    </span><span class="sd">&quot;&quot;&quot;</span>
<span class="sd">        冷启动的情况下召回一些文章</span>
<span class="sd">        :param user_recall_items_dict: 基于内容embedding相似性召回来的很多文章， 字典， {user1: [item1, item2, ..], }</span>
<span class="sd">        :param user_hist_item_typs_dict: 字典， 用户点击的文章的主题映射</span>
<span class="sd">        :param user_hist_item_words_dict: 字典， 用户点击的历史文章的字数映射</span>
<span class="sd">        :param user_last_item_created_time_idct: 字典，用户点击的历史文章创建时间映射</span>
<span class="sd">        :param item_tpye_idct: 字典，文章主题映射</span>
<span class="sd">        :param item_words_dict: 字典，文章字数映射</span>
<span class="sd">        :param item_created_time_dict: 字典， 文章创建时间映射</span>
<span class="sd">        :param click_article_ids_set: 集合，用户点击过得文章, 也就是日志里面出现过的文章</span>
<span class="sd">        :param recall_item_num: 召回文章的数量， 这个指的是没有出现在日志里面的文章数量</span>
<span class="sd">    &quot;&quot;&quot;</span>

    <span class="n">cold_start_user_items_dict</span> <span class="o">=</span> <span class="p">{}</span>
    <span class="k">for</span> <span class="n">user</span><span class="p">,</span> <span class="n">item_list</span> <span class="ow">in</span> <span class="n">tqdm</span><span class="p">(</span><span class="n">user_recall_items_dict</span><span class="o">.</span><span class="n">items</span><span class="p">(),</span> <span class="n">disable</span><span class="o">=</span><span class="ow">not</span> <span class="n">logger</span><span class="o">.</span><span class="n">isEnabledFor</span><span class="p">(</span><span class="n">logging</span><span class="o">.</span><span class="n">DEBUG</span><span class="p">)):</span>
        <span class="n">cold_start_user_items_dict</span><span class="o">.</span><span class="n">setdefault</span><span class="p">(</span><span class="n">user</span><span class="p">,</span> <span class="p">[])</span>
        <span class="k">for</span> <span class="n">item</span><span class="p">,</span> <span class="n">score</span> <span class="ow">in</span> <span class="n">item_list</span><span class="p">:</span>
            <span class="c1"># 获取历史文章信息</span>
            <span class="n">hist_item_type_set</span> <span class="o">=</span> <span class="n">user_hist_item_typs_dict</span><span class="p">[</span><span class="n">user</span><span class="p">]</span>
            <span class="n">hist_mean_words</span> <span class="o">=</span> <span class="n">user_hist_item_words_dict</span><span class="p">[</span><span class="n">user</span><span class="p">]</span>
            <span class="n">hist_last_item_created_time</span> <span class="o">=</span> <span class="n">user_last_item_created_time_dict</span><span class="p">[</span><span class="n">user</span><span class="p">]</span>
            <span class="n">hist_last_item_created_time</span> <span class="o">=</span> <span class="n">datetime</span><span class="o">.</span><span class="n">fromtimestamp</span><span class="p">(</span><span class="n">hist_last_item_created_time</span><span class="p">)</span>

            <span class="c1"># 获取当前召回文章的信息</span>
            <span class="n">curr_item_type</span> <span class="o">=</span> <span class="n">item_type_dict</span><span class="p">[</span><span class="n">item</span><span class="p">]</span>
            <span class="n">curr_item_words</span> <span class="o">=</span> <span class="n">item_words_dict</span><span class="p">[</span><span class="n">item</span><span class="p">]</span>
            <span class="n">curr_item_created_time</span> <span class="o">=</span> <span class="n">item_created_time_dict</span><span class="p">[</span><span class="n">item</span><span class="p">]</span>
            <span class="n">curr_item_created_time</span> <span class="o">=</span> <span class="n">datetime</span><span class="o">.</span><span class="n">fromtimestamp</span><span class="p">(</span><span class="n">curr_item_created_time</span><span class="p">)</span>

            <span class="c1"># 首先，文章不能出现在用户的历史点击中， 然后根据文章主题，文章单词数，文章创建时间进行筛选</span>
            <span class="k">if</span> <span class="n">curr_item_type</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">hist_item_type_set</span> <span class="ow">or</span> \
                <span class="n">item</span> <span class="ow">in</span> <span class="n">click_article_ids_set</span> <span class="ow">or</span> \
                <span class="nb">abs</span><span class="p">(</span><span class="n">curr_item_words</span> <span class="o">-</span> <span class="n">hist_mean_words</span><span class="p">)</span> <span class="o">&gt;</span> <span class="mi">200</span> <span class="ow">or</span> \
                <span class="nb">abs</span><span class="p">((</span><span class="n">curr_item_created_time</span> <span class="o">-</span> <span class="n">hist_last_item_created_time</span><span class="p">)</span><span class="o">.</span><span class="n">days</span><span class="p">)</span> <span class="o">&gt;</span> <span class="mi">90</span><span class="p">:</span>
                <span class="k">continue</span>

            <span class="n">cold_start_user_items_dict</span><span class="p">[</span><span class="n">user</span><span class="p">]</span><span class="o">.</span><span class="n">append</span><span class="p">((</span><span class="n">item</span><span class="p">,</span> <span class="n">score</span><span class="p">))</span>      <span class="c1"># {user1: [(item1, score1), (item2, score2)..]...}</span>

    <span class="c1"># 需要控制一下冷启动召回的数量</span>
    <span class="n">cold_start_user_items_dict</span> <span class="o">=</span> <span class="p">{</span><span class="n">k</span><span class="p">:</span> <span class="nb">sorted</span><span class="p">(</span><span class="n">v</span><span class="p">,</span> <span class="n">key</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span><span class="n">x</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">reverse</span><span class="o">=</span><span class="kc">True</span><span class="p">)[:</span><span class="n">recall_item_num</span><span class="p">]</span> \
                                  <span class="k">for</span> <span class="n">k</span><span class="p">,</span> <span class="n">v</span> <span class="ow">in</span> <span class="n">cold_start_user_items_dict</span><span class="o">.</span><span class="n">items</span><span class="p">()}</span>

    <span class="n">pickle</span><span class="o">.</span><span class="n">dump</span><span class="p">(</span><span class="n">cold_start_user_items_dict</span><span class="p">,</span> <span class="nb">open</span><span class="p">(</span><span class="n">save_path</span> <span class="o">/</span> <span class="s1">&#39;cold_start_user_items_dict.pkl&#39;</span><span class="p">,</span> <span class="s1">&#39;wb&#39;</span><span class="p">))</span>

    <span class="k">return</span> <span class="n">cold_start_user_items_dict</span>
</pre></div>
</div>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">all_click_df_</span> <span class="o">=</span> <span class="n">all_click_df</span><span class="o">.</span><span class="n">copy</span><span class="p">()</span>
<span class="n">all_click_df_</span> <span class="o">=</span> <span class="n">all_click_df_</span><span class="o">.</span><span class="n">merge</span><span class="p">(</span><span class="n">item_info_df</span><span class="p">,</span> <span class="n">how</span><span class="o">=</span><span class="s1">&#39;left&#39;</span><span class="p">,</span> <span class="n">on</span><span class="o">=</span><span class="s1">&#39;click_article_id&#39;</span><span class="p">)</span>
<span class="n">user_hist_item_typs_dict</span><span class="p">,</span> <span class="n">user_hist_item_ids_dict</span><span class="p">,</span> <span class="n">user_hist_item_words_dict</span><span class="p">,</span> <span class="n">user_last_item_created_time_dict</span> <span class="o">=</span> <span class="n">get_user_hist_item_info_dict</span><span class="p">(</span><span class="n">all_click_df_</span><span class="p">)</span>
<span class="n">click_article_ids_set</span> <span class="o">=</span> <span class="n">get_click_article_ids_set</span><span class="p">(</span><span class="n">all_click_df</span><span class="p">)</span>
<span class="c1"># 需要注意的是</span>
<span class="c1"># 这里使用了很多规则来筛选冷启动的文章，所以前面再召回的阶段就应该尽可能的多召回一些文章，否则很容易被删掉</span>
<span class="n">cold_start_user_items_dict</span> <span class="o">=</span> <span class="n">cold_start_items</span><span class="p">(</span><span class="n">user_recall_items_dict</span><span class="p">,</span> <span class="n">user_hist_item_typs_dict</span><span class="p">,</span> <span class="n">user_hist_item_words_dict</span><span class="p">,</span> \
                                              <span class="n">user_last_item_created_time_dict</span><span class="p">,</span> <span class="n">item_type_dict</span><span class="p">,</span> <span class="n">item_words_dict</span><span class="p">,</span> \
                                              <span class="n">item_created_time_dict</span><span class="p">,</span> <span class="n">click_article_ids_set</span><span class="p">,</span> <span class="n">recall_item_num</span><span class="p">)</span>

<span class="n">user_multi_recall_dict</span><span class="p">[</span><span class="s1">&#39;cold_start_recall&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">cold_start_user_items_dict</span>
</pre></div>
</div>
</section>
<section id="id14">
<h2><span class="section-number">6.4.6. </span>多路召回合并<a class="headerlink" href="#id14" title="Permalink to this heading">¶</a></h2>
<p>多路召回合并就是将前面所有的召回策略得到的用户文章列表合并起来，下面是对前面所有召回结果的汇总
1. 基于itemcf计算的item之间的相似度sim进行的召回 2.
基于embedding搜索得到的item之间的相似度进行的召回 3. YoutubeDNN召回 4.
YoutubeDNN得到的user之间的相似度进行的召回 5. 基于冷启动策略的召回</p>
<div class="line-block">
<div class="line"><strong>注意：</strong></div>
<div class="line">在做召回评估的时候就会发现有些召回的效果不错有些召回的效果很差，所以对每一路召回的结果，我们可以认为的定义一些权重，来做最终的相似度融合</div>
</div>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="k">def</span><span class="w"> </span><span class="nf">combine_recall_results</span><span class="p">(</span><span class="n">user_multi_recall_dict</span><span class="p">,</span> <span class="n">weight_dict</span><span class="o">=</span><span class="kc">None</span><span class="p">,</span> <span class="n">topk</span><span class="o">=</span><span class="mi">25</span><span class="p">):</span>
    <span class="n">final_recall_items_dict</span> <span class="o">=</span> <span class="p">{}</span>

    <span class="c1"># 对每一种召回结果按照用户进行归一化，方便后面多种召回结果，相同用户的物品之间权重相加</span>
    <span class="k">def</span><span class="w"> </span><span class="nf">norm_user_recall_items_sim</span><span class="p">(</span><span class="n">sorted_item_list</span><span class="p">):</span>
        <span class="c1"># 如果冷启动中没有文章或者只有一篇文章，直接返回，出现这种情况的原因可能是冷启动召回的文章数量太少了，</span>
        <span class="c1"># 基于规则筛选之后就没有文章了, 这里还可以做一些其他的策略性的筛选</span>
        <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">sorted_item_list</span><span class="p">)</span> <span class="o">&lt;</span> <span class="mi">2</span><span class="p">:</span>
            <span class="k">return</span> <span class="n">sorted_item_list</span>

        <span class="n">min_sim</span> <span class="o">=</span> <span class="n">sorted_item_list</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">][</span><span class="mi">1</span><span class="p">]</span>
        <span class="n">max_sim</span> <span class="o">=</span> <span class="n">sorted_item_list</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="mi">1</span><span class="p">]</span>

        <span class="n">norm_sorted_item_list</span> <span class="o">=</span> <span class="p">[]</span>
        <span class="k">for</span> <span class="n">item</span><span class="p">,</span> <span class="n">score</span> <span class="ow">in</span> <span class="n">sorted_item_list</span><span class="p">:</span>
            <span class="k">if</span> <span class="n">max_sim</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">:</span>
                <span class="n">norm_score</span> <span class="o">=</span> <span class="mf">1.0</span> <span class="o">*</span> <span class="p">(</span><span class="n">score</span> <span class="o">-</span> <span class="n">min_sim</span><span class="p">)</span> <span class="o">/</span> <span class="p">(</span><span class="n">max_sim</span> <span class="o">-</span> <span class="n">min_sim</span><span class="p">)</span> <span class="k">if</span> <span class="n">max_sim</span> <span class="o">&gt;</span> <span class="n">min_sim</span> <span class="k">else</span> <span class="mf">1.0</span>
            <span class="k">else</span><span class="p">:</span>
                <span class="n">norm_score</span> <span class="o">=</span> <span class="mf">0.0</span>
            <span class="n">norm_sorted_item_list</span><span class="o">.</span><span class="n">append</span><span class="p">((</span><span class="n">item</span><span class="p">,</span> <span class="n">norm_score</span><span class="p">))</span>

        <span class="k">return</span> <span class="n">norm_sorted_item_list</span>

    <span class="nb">print</span><span class="p">(</span><span class="s1">&#39;多路召回合并...&#39;</span><span class="p">)</span>
    <span class="k">for</span> <span class="n">method</span><span class="p">,</span> <span class="n">user_recall_items</span> <span class="ow">in</span> <span class="n">tqdm</span><span class="p">(</span><span class="n">user_multi_recall_dict</span><span class="o">.</span><span class="n">items</span><span class="p">(),</span> <span class="n">disable</span><span class="o">=</span><span class="ow">not</span> <span class="n">logger</span><span class="o">.</span><span class="n">isEnabledFor</span><span class="p">(</span><span class="n">logging</span><span class="o">.</span><span class="n">DEBUG</span><span class="p">)):</span>
        <span class="nb">print</span><span class="p">(</span><span class="n">method</span> <span class="o">+</span> <span class="s1">&#39;...&#39;</span><span class="p">)</span>
        <span class="c1"># 在计算最终召回结果的时候，也可以为每一种召回结果设置一个权重</span>
        <span class="k">if</span> <span class="n">weight_dict</span> <span class="o">==</span> <span class="kc">None</span><span class="p">:</span>
            <span class="n">recall_method_weight</span> <span class="o">=</span> <span class="mi">1</span>
        <span class="k">else</span><span class="p">:</span>
            <span class="n">recall_method_weight</span> <span class="o">=</span> <span class="n">weight_dict</span><span class="p">[</span><span class="n">method</span><span class="p">]</span>

        <span class="k">for</span> <span class="n">user_id</span><span class="p">,</span> <span class="n">sorted_item_list</span> <span class="ow">in</span> <span class="n">user_recall_items</span><span class="o">.</span><span class="n">items</span><span class="p">():</span> <span class="c1"># 进行归一化</span>
            <span class="n">user_recall_items</span><span class="p">[</span><span class="n">user_id</span><span class="p">]</span> <span class="o">=</span> <span class="n">norm_user_recall_items_sim</span><span class="p">(</span><span class="n">sorted_item_list</span><span class="p">)</span>

        <span class="k">for</span> <span class="n">user_id</span><span class="p">,</span> <span class="n">sorted_item_list</span> <span class="ow">in</span> <span class="n">user_recall_items</span><span class="o">.</span><span class="n">items</span><span class="p">():</span>
            <span class="c1"># print(&#39;user_id&#39;)</span>
            <span class="n">final_recall_items_dict</span><span class="o">.</span><span class="n">setdefault</span><span class="p">(</span><span class="n">user_id</span><span class="p">,</span> <span class="p">{})</span>
            <span class="k">for</span> <span class="n">item</span><span class="p">,</span> <span class="n">score</span> <span class="ow">in</span> <span class="n">sorted_item_list</span><span class="p">:</span>
                <span class="n">final_recall_items_dict</span><span class="p">[</span><span class="n">user_id</span><span class="p">]</span><span class="o">.</span><span class="n">setdefault</span><span class="p">(</span><span class="n">item</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>
                <span class="n">final_recall_items_dict</span><span class="p">[</span><span class="n">user_id</span><span class="p">][</span><span class="n">item</span><span class="p">]</span> <span class="o">+=</span> <span class="n">recall_method_weight</span> <span class="o">*</span> <span class="n">score</span>

    <span class="n">final_recall_items_dict_rank</span> <span class="o">=</span> <span class="p">{}</span>
    <span class="c1"># 多路召回时也可以控制最终的召回数量</span>
    <span class="k">for</span> <span class="n">user</span><span class="p">,</span> <span class="n">recall_item_dict</span> <span class="ow">in</span> <span class="n">final_recall_items_dict</span><span class="o">.</span><span class="n">items</span><span class="p">():</span>
        <span class="n">final_recall_items_dict_rank</span><span class="p">[</span><span class="n">user</span><span class="p">]</span> <span class="o">=</span> <span class="nb">sorted</span><span class="p">(</span><span class="n">recall_item_dict</span><span class="o">.</span><span class="n">items</span><span class="p">(),</span> <span class="n">key</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">reverse</span><span class="o">=</span><span class="kc">True</span><span class="p">)[:</span><span class="n">topk</span><span class="p">]</span>

    <span class="c1"># 将多路召回后的最终结果字典保存到本地</span>
    <span class="n">pickle</span><span class="o">.</span><span class="n">dump</span><span class="p">(</span><span class="n">final_recall_items_dict</span><span class="p">,</span> <span class="nb">open</span><span class="p">(</span><span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">save_path</span><span class="p">,</span> <span class="s1">&#39;final_recall_items_dict.pkl&#39;</span><span class="p">),</span><span class="s1">&#39;wb&#39;</span><span class="p">))</span>

    <span class="k">return</span> <span class="n">final_recall_items_dict_rank</span>
</pre></div>
</div>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="c1"># 这里直接对多路召回的权重给了一个相同的值，其实可以根据前面召回的情况来调整参数的值</span>
<span class="n">weight_dict</span> <span class="o">=</span> <span class="p">{</span><span class="s1">&#39;itemcf_sim_itemcf_recall&#39;</span><span class="p">:</span> <span class="mf">1.0</span><span class="p">,</span>
               <span class="s1">&#39;embedding_sim_item_recall&#39;</span><span class="p">:</span> <span class="mf">1.0</span><span class="p">,</span>
               <span class="s1">&#39;youtubednn_recall&#39;</span><span class="p">:</span> <span class="mf">1.0</span><span class="p">,</span>
               <span class="s1">&#39;youtubednn_usercf_recall&#39;</span><span class="p">:</span> <span class="mf">1.0</span><span class="p">,</span>
               <span class="s1">&#39;cold_start_recall&#39;</span><span class="p">:</span> <span class="mf">1.0</span><span class="p">}</span>
</pre></div>
</div>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="c1"># 最终合并之后每个用户召回150个商品进行排序</span>
<span class="n">final_recall_items_dict_rank</span> <span class="o">=</span> <span class="n">combine_recall_results</span><span class="p">(</span><span class="n">user_multi_recall_dict</span><span class="p">,</span> <span class="n">weight_dict</span><span class="p">,</span> <span class="n">topk</span><span class="o">=</span><span class="mi">150</span><span class="p">)</span>
</pre></div>
</div>
</section>
<section id="id15">
<h2><span class="section-number">6.4.7. </span>总结<a class="headerlink" href="#id15" title="Permalink to this heading">¶</a></h2>
<p>上述实现了如下召回策略：</p>
<ol class="arabic simple">
<li><p>基于关联规则的itemcf</p></li>
<li><p>基于关联规则的usercf</p></li>
<li><p>youtubednn召回</p></li>
<li><p>冷启动召回</p></li>
</ol>
<p>对于上述实现的召回策略其实都不是最优的结果，我们只是做了个简单的尝试，其中还有很多地方可以优化，包括已经实现的这些召回策略的参数或者新加一些，修改一些关联规则都可以。当然还可以尝试更多的召回策略，比如对新闻进行热度召回等等。</p>
</section>
</section>


        </div>
        <div class="side-doc-outline">
            <div class="side-doc-outline--content"> 
<div class="localtoc">
    <p class="caption">
      <span class="caption-text">Table Of Contents</span>
    </p>
    <ul>
<li><a class="reference internal" href="#">6.4. 多路召回</a><ul>
<li><a class="reference internal" href="#id2">6.4.1. 读取数据</a></li>
<li><a class="reference internal" href="#id3">6.4.2. 工具函数</a><ul>
<li><a class="reference internal" href="#id4">6.4.2.1. 获取用户-文章-时间函数</a></li>
<li><a class="reference internal" href="#id5">6.4.2.2. 获取文章-用户-时间函数</a></li>
<li><a class="reference internal" href="#id6">6.4.2.3. 获取历史和最后一次点击</a></li>
<li><a class="reference internal" href="#id7">6.4.2.4. 获取文章属性特征</a></li>
<li><a class="reference internal" href="#id8">6.4.2.5. 获取用户历史点击的文章信息</a></li>
<li><a class="reference internal" href="#top-k">6.4.2.6. 获取点击次数最多的Top-k个文章</a></li>
<li><a class="reference internal" href="#id9">6.4.2.7. 定义多路召回字典</a></li>
<li><a class="reference internal" href="#id10">6.4.2.8. 召回效果评估</a></li>
</ul>
</li>
<li><a class="reference internal" href="#id11">6.4.3. 计算相似性矩阵</a><ul>
<li><a class="reference internal" href="#itemcf-i2i-sim">6.4.3.1. itemCF i2i_sim</a></li>
<li><a class="reference internal" href="#usercf-u2u-sim">6.4.3.2. userCF u2u_sim</a></li>
<li><a class="reference internal" href="#item-embedding-sim">6.4.3.3. item embedding sim</a></li>
</ul>
</li>
<li><a class="reference internal" href="#id12">6.4.4. 召回</a><ul>
<li><a class="reference internal" href="#youtubednn">6.4.4.1. YoutubeDNN召回</a></li>
<li><a class="reference internal" href="#itemcf-recall">6.4.4.2. itemCF recall</a><ul>
<li><a class="reference internal" href="#itemcf-sim">6.4.4.2.1. itemCF sim召回</a></li>
<li><a class="reference internal" href="#embedding-sim">6.4.4.2.2. embedding sim 召回</a></li>
</ul>
</li>
<li><a class="reference internal" href="#usercf">6.4.4.3. userCF召回</a><ul>
<li><a class="reference internal" href="#usercf-sim">6.4.4.3.1. userCF sim召回</a></li>
<li><a class="reference internal" href="#user-embedding-sim">6.4.4.3.2. user embedding sim召回</a></li>
</ul>
</li>
</ul>
</li>
<li><a class="reference internal" href="#id13">6.4.5. 冷启动问题</a></li>
<li><a class="reference internal" href="#id14">6.4.6. 多路召回合并</a></li>
<li><a class="reference internal" href="#id15">6.4.7. 总结</a></li>
</ul>
</li>
</ul>

</div>
            </div>
        </div>

      <div class="clearer"></div>
    </div><div class="pagenation">
     <a id="button-prev" href="3.analysis.html" class="mdl-button mdl-js-button mdl-js-ripple-effect mdl-button--colored" role="botton" accesskey="P">
         <i class="pagenation-arrow-L fas fa-arrow-left fa-lg"></i>
         <div class="pagenation-text">
            <span class="pagenation-direction">Previous</span>
            <div>6.3. 数据分析</div>
         </div>
     </a>
     <a id="button-next" href="5.feature_engineering.html" class="mdl-button mdl-js-button mdl-js-ripple-effect mdl-button--colored" role="botton" accesskey="N">
         <i class="pagenation-arrow-R fas fa-arrow-right fa-lg"></i>
        <div class="pagenation-text">
            <span class="pagenation-direction">Next</span>
            <div>6.5. 特征工程</div>
        </div>
     </a>
  </div>
        
        </main>
    </div>
  </body>
</html>